You are on page 1of 36

[IT0089] MODULE 5 (MAIN) ] OVERVIEW 1. Put them closer together in the store.

OF ASSOCIATION ANALYSIS 2. Put them far apart in the store.


3. Package candy bars with the dolls
WHAT ARE PATTERNS? 4. Package Barbie + candy + poorly selling
- Patterns are set of items, subsequences, or item.
substructures that occur frequently together 5. Raise the price on one, and lower it on the
(or strongly correlated) in a data set other.
- Patterns represent intrinsic and importance 6. Offer Barbie accessories for proofs of
properties of datasets purchase
7. Do not advertise candy and Barbie together
WHEN DO YOU USE PATTERN 8. Offer candies in the shape of a Barbie doll
DISCOVERY?
PROCESS OF RULE SELECTION
 What products were often purchased
together? Generate all rules that meet specified support &
 What are the subsequent purchases after confidence
buying an iPad?
 Find frequent item sets (those with sufficient
 What code segments likely contain copy-
support – see above)
and-paste bugs?
 From these item sets, generate rules with
 What word sequences likely form phrases in
sufficient confidence
this corpus?
RULE INTERPRETATION
CONCEPT OF MARKET BASKET ANALYSIS
Lift Ratio
- Market basket analysis is like an imaginary
basket used by retailers to check the - shows how effective the rule is in finding
combination of two or more items that the consequents (useful if finding particular
customers are likely to buy consequents is important)
- “Two-thirds of what we buy in the
supermarket we had no intention of buying,” Confidence
- Paco Underhill, author of Why We Buy: - shows the rate at which consequents will be
The Science of Shopping found (useful in learning costs of promotion)
ASSOCIATION RULE IS THE FOUNDATION Support
OF SEVERAL RECOMMENDER SYSTEMS
- measures overall impact

RULE SET (XLMiner)

CHALLENGES
POSSIBLE ACTIONS BASED ON
ASSOCIATION RULE - Random data can generate apparently
interesting association rules
BARBIE DOLL -> CANDY
- The more rules you produce, the greater this
danger
- Rules based on large numbers of records are
less subject to this danger

M5 (ST1- PATTERN DISCOVERY)

WHAT IS THE ESSENCE OF DATA MINING?

“...the discovery of interesting, unexpected, or


valuable structures in large data sets.” –David Hand

“If you’ve got terabytes of data, and you’re relying PATTERN DISCOVERY KEY IDEAS
on data mining to find interesting things in there for
 Patterns are not known
you, you’ve lost before you’ve even begun.” –Herb
 But data which are believed to possess
Edelstein
patterns are given
WHAT IS PATTERN DISCOVERY?  Examples:
o Clustering: grouping similar
- According to SAS, it is one of the broad samples into clusters
categories of analytical methods associated o Associative rule mining: discover
with data mining. certain features that often appear
- It is a process of uncovering patterns from together in data
massive data sets.  Patterns are known beforehand and are
PATTERN DISCOVERY CAUTION observed/described by:
o Explicit samples
 Poor data quality o Similar samples (usually)
 Opportunity  Modeling approaches
 Interventions o Build a model for each pattern
 Separability o Find the best fit model for new data
 Obviousness  Usually require training using observed
 Non-stationarity samples

APPLICATIONS OF PATTERN DISCOVERY SEQUENTIAL PATTERN

 Data reduction - Shopping sequences, medical treatments,


 Novelty detection natural disasters, weblog click streams,
 Profiling programming execution sequences, DNA,
 Market basket analysis protein, etc.
 Sequence analysis PATTERN DISCOVERY TECHNIQUES

1. CLUSTERING
- Clustering aims at dividing the data set into
groups (clusters) where the inter-cluster
similarities are minimized while the
similarities within each cluster are
maximized
2. ASSOCIATION RULE MINING
- Association rule discovery on usage data
results in finding groups of pages that are
commonly accessed
3. SEQUENTIAL PATTERN MINING
4. CLASSIFICATION
5. DECISION TREES
6. NAÏVE BAYES CLASSIFIER

SUPPORT

- The strength of the association is measured


by the support and confidence of the rule.
- Support is estimated using the following:

CONFIDENCE

- The support for the rule A =>B is the - defines the likeliness of occurrence of
probability that the two item sets occur consequent on the cart given that the cart
together already has the antecedents
- Confidence is estimated using the following:
CALCULATING FOR THE SUPPORT

TRANSACTION ITEMS
ID
Lanz Bread, Milk
Ong Bread, Lady’s Choice, Eggs - Technically, confidence is the conditional
Gonzales Bread, Milk, Butter probability of occurrence of consequent
Gepilano Bread, Diaper, Coke given the antecedent.
Francisco Bread, Diaper, Milk
Tupaz Bread, Milk, Butter CALCULATING FOR THE CONFIDENCE
support for the bread => milk is 4/6

RULE INTERPRETATION

LIFT RATIO - shows how effective the rule is in


finding consequents (useful if finding particular
consequents is important)
CONFIDENCE - shows the rate at which - Assumes transaction database is memory
consequents will be found (useful in learning costs resident
of promotion) - Requires many database scans

SUPPORT - measures overall impact KEY IDEAS

M5 (ST2- ASSOCIATION RULE)


 a set of all the items
ASSOCIATION RULE DISCOVERY  Transaction T: a set of items such that T I
 Transaction Database D: a set of
- Market basket analysis (also known as
transactions
association rule discovery or affinity
 A transaction T  I contains a set X  I of
analysis) is a popular data mining method.
some items, if X  T
In the simplest situation, the data consists of
two variables: a transaction and an item  An Association Rule: is an implication of
the form X  Y, where X, Y  I
DID YOU KNOW?  A set of items is referred as an itemset. An
itemset that contains k items is a k-itemset.
- Forbes (Palmeri 1997) reported that a major
 The support s of an itemset X is the
retailer determined that customers who buy
percentage of transactions in the transaction
Barbie dolls have a 60% likelihood of
database D that contain X
buying one of three types of candy bars.
 The support of the rule X  Y in the
APPLICATIONS OF ASSOCIATION RULE transaction database D is the support of the
items set X  Y in D.
 Market Basket Analysis: given a database
 The confidence of the rule X  Y in the
of customer transactions, where each
transaction database D is the ratio of the
transaction is a set of items the goal is to
number of transactions in D that contain X
find groups of items which are frequently
 Y to the number of transactions that
purchased together
contain X in D.
 Telecommunication: (each customer is a
transaction containing the set of phone calls) SAMPLE SCENARIO
 Credit Cards/Banking Services: (each
card/account is a transaction containing the
set of customer’s payments)
 Medical Treatments: (each patient is
represented as a transaction containing the
ordered set of diseases)
Support- the greater percentage the better
 Basketball-Game Analysis: (each game is
represented as a transaction containing the - How frequently the if-then statement
ordered set of ball passes) relationship appears in the database. Ilang
beses siya (x->y)
ADVANTAGES
- 60 % likelihood. Ang chance na magsama
- Uses large item set property sila or lumitaw ang combination ng dalawa.
- Easily parallelized
Confidence- pertains to how many times this
- Easy to implement
relationships have been found to be true
DISADVANTAGES
- This is the occurrence, the chance or
likelihood of two items to be bought after
taking item A.
- 75 % of transaction that buys bread, also
buys butter.

VARIABLES IN ASSOCIATION RULE

 GIVEN: MANY RULES ARE POSSIBLE


- a set I of all the items;
- a database D of transactions; For example: Transaction 1 supports several rules,
- minimum support s; such as
- minimum confidence c;
 “If red, then white” (“If a red faceplate is
 FIND: purchased, then so is a white one”)
- all association rules X  Y with a minimum  “If white, then red”
support s and confidence c.
 “If red and white, then green”
PROBLEM DECOMPOSITION  + several more

 Find all sets of items that have minimum EXAMPLE RULE


support (frequent itemsets)
{red, white} > {green} with confidence = 2/4 =
 Use the frequent itemsets to generate the 50%
desired rules
 [(support {red, white, green})/(support {red,
white})]

{red, green} > {white} with confidence = 2/2 =


100%

 [(support {red, white, green})/(support


{red, green})]

Plus 4 more with confidence of 100%, 33%, 29%


& 100%

If confidence criterion is 70%, report only rules


TERMS
2, 3 and 6
“IF” part = antecedent
SAMPLE ASSOCIATION PROBLEM
“THEN” part = consequent

“Item set” = the items (e.g., products) comprising


the antecedent or consequent

 Antecedent and consequent are disjoint (i.e.,


have no items in common)

EXAMPLE: PHONE FACEPLATES


TASK 1: COMPUTE FOR THE SUPPORT ASSOCIATION RULE MINING

TASK 2: COMPUTE FOR THE CONFIDENCE

TASK 3: COMPUTE FOR THE LIFT RATIO


Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
FORMATIVES: Trans E Beer, Peanut, Egg
17/20 Trans F Egg, Beer, Peanut
[1.5] [2]
1. Write the support formula for the following
expression:

A => B TRANSACTION ITEMS


ID
[ (transactions that contain every item in A and Trans A Beer, Peanut, Egg
B) / (all transactions) ] Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
2. It is intended to select the “best” subset of Trans E Beer, Peanut, Egg
predictors. (Use lowercase for your answer) Trans F Egg, Beer, Peanut
[variable selection]
3. When it comes to association analysis, the
8. observe the table below and compute for the
more rules you produce, the greater the risk
support for beer ->peanut [4/7]
is. [TRUE]
9. Which of the following is not an application
4. Observe the table below and compute for
of a sequential pattern? [IDENTIFYING
the confidence of beer -> diaper [NONE
FAKE NEWS]
OF THE CHOICES] [1/4 OR 0.25]
10. Which of the following is not an application
TRANSACTION ITEMS of pattern discovery? [NONE OF THE
ID CHOICES]
Trans A Beer, Peanut, Egg 11. It shows how effective the rule is in finding
Trans B Beer, Milk, Peanut, Diaper consequents [LIFT RATIO]
Trans C Milk, Diaper, Egg 12. Trend series a sequence of observations on a
Trans D Peanut, Egg, Diaper
variable measured at successive points in
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut time or over successive periods of time
5. Supposed you want to solve a time series [FALSE]
problem where a rapid response to a real 13. Which of the following is not an advantage
change in the pattern of observations is of using association rule? [ASSUMES
desired, which among the following is the TRANSACTION DATABASE IS
ideal value for your alpha? [0.8] MEMORY RESIDENT]
6. Affinity analysis is a data mining method 14. A trend is usually the result of long-term
that usually consists of two variables: a factors such as population increases or
transaction and an item [TRUE] decreases, shifting demographic
7. Observe the table below and compute for the characteristics of the population, improving
lift ratio of technology, changes in the competitive
landscape, and/or changes in consumer
Diaper ->milk preferences [TRUE]

TRANSACTION ITEMS TRANSACTION ITEMS


ID ID
Trans A Beer, Peanut, Egg Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg 18. Clustering aims to discover certain features
Trans D Peanut, Egg, Diaper that often appear together in data [FALSE]
Trans E Beer, Peanut, Egg 19. observe the table below and compute for the
Trans F Egg, Beer, Peanut support for diaper ->peanut [0.43]

TRANSACTION ITEMS
15. observe the table below and compute for the ID
support for beer ->peanut [4/7] Trans A Beer, Peanut, Egg
16. observe the table below and compute for the Trans B Beer, Milk, Peanut, Diaper
support for SD Card => phone case [0.3] Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
Trans G Beer, Diaper, Peanut

20. It is another type of association analysis that


involves using sequence data
[ASSOCIATION RULE]

17/20

1. Segmentation is a data mining method that


usually consists of two variables: a
transaction and an item [FALSE]
2. It is useful tool for data reduction, such as
choosing the best variables or cluster
components for analysis. (use lowercase for
your answer) [variable clustering]
3. It controls for the support (frequency) of
consequent while calculating the conditional
probability of occurrence of {Y} given {X}
[LIFT RATIO]
4.

TRANSACTION ITEMS
ID
Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
Trans G Beer, Diaper, Peanut

observe the table below and compute for the support


17. It measures the overall impact [SUPPORT]
for beer ->peanut [4/7] [5/7]
5. Association rule mining is about grouping Trans A Beer, Peanut, Egg
similar samples into clusters [FALSE] Trans B Beer, Milk, Peanut, Diaper
6. It is the process of discovering useful Trans C Milk, Diaper, Egg
patterns and trends in large data sets (use Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
lowercase for your answer) [data mining]
Trans F Egg, Beer, Peanut
7. observe the table below and compute for the
Trans G Beer, Diaper, Peanut
support for beer => egg [0.43] 12. it shows how effective the rule is in finding
TRANSACTION ITEMS components (consequents) [LIFT RATIO]
ID 13. observe the table below and compute for the
Trans A Beer, Peanut, Egg support for diaper ->peanut [0.43]
Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg TRANSACTION ITEMS
Trans D Peanut, Egg, Diaper ID
Trans E Beer, Peanut, Egg Trans A Beer, Peanut, Egg
Trans F Egg, Beer, Peanut Trans B Beer, Milk, Peanut, Diaper
Trans G Beer, Diaper, Peanut Trans C Milk, Diaper, Egg
8. Observe the table below and compute for the Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
lift ratio of
Trans F Egg, Beer, Peanut
Egg ->Peanut Trans G Beer, Diaper, Peanut
14. observe the table below and compute for the
TRANSACTION ITEMS support for airpods => charger [0.4]
ID
Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
Trans G Beer, Diaper, Peanut
[0.93]

9. Write the support formula for the following


expressions:

A => B

[(transactions that contain


every item in A and B)/(all
transactions)]

10. It is the conditional probability of


occurrence of consequent given the
antecedent [CONFIDENCE]
11. observe the table below and compute for the 15. observe the table below and compute for the
support for milk ->egg [0.14] support for Fries => Burger [0.6]
TRANSACTION ITEMS
ID
20. Observe the following table and compute for
the lift ratio of cake -> fries [0.1]

16. Input validation helps to lessen what type of


anomaly? [INSERTION ANOMALY]
17. observe the table below and compute for the
support for Phone Case => SD Card [0.5] 20/20
1. It is intended to select the “best” subset of
predictors. [Variables Selection]
2. Write the Support formula for the following
expression;

A => B

Answer:(transactions that contain every


item A and B)/(all transactions)

3. Affinity analysis is a data mining method


that usually consists of two variables a
transaction and an item. [TRUE]
4. Which of the following is not an application
of a sequential pattern?[Identifying fake
news]
5. Which of the following is not an application
of pattern discovery? [None of pattern
discovery]
6. It shows how effective the rule is in finding
consequents. [Lift ratio]
7. Trend series a sequence of observation on a
variable measured at successive points in
18. in the association analysis, the values of time or over successive periods of time.
[False] (time series dapat)
items can be categoric or numeric [FALSE]
8. Observe the table below and compute for
-kasi dapat categoric lang
the confidence of
19. Lift ratio shows how effective the rule is in Beer->diaper.
finding consequent. [TRUE]
Answer:0.3
Answer: None of the choices
12. Observe the table below and compute for
9. Observe the table below and compute for
the confidence of the lift ratio of
Diaper => peanut

Answer: 1.5
13. When it comes to association analysis, the
more rules you produce, the greater the risk
Answer: 0.43
is. [True]
10. Observe the table below and compute for 14. Supposed you want to solve a time series
the support of problem where a rapid response to a real
Beer=>peanut change in the pattern of observations is
desired, which among the following is the
Answer: 4/7 ideal value for your alpha? [0.8]
11. Observe the table below and compute for 15. Which of the following is not an advantage
the support of of using association rule?
SD Card => Phone case
[Assumes transaction database is memory 5. It is the process of discovering useful
resident.] patterns and trends in large data
16. A trend is usually the result of long-term sets.[data mining]
factors such as population increases or 6. observe the table below and compute
decreases, shifting demographic for the lift ratio of
characteristics of population, improving egg -> Peanut
technology, changes in the competitive
landscape, and/or changes in consumer
preferences. [True]
17. It measures the overall impact. [Support]
18. Clustering aims to discover certain features
that often appear together in data.[False]
19. It is another type of association analysis
that involves using sequence data.
[Association Rule]
20. Observe the table below and compute for
the support of
SD Card => Phone case

Answer: 0.93
7. observe the table below and compute
for the confidence of
Airpods->powerbank

Answer:0.3

1. Segmentation is a data mining method


that usually consists of two variables a
transaction and an item. [False]
Answer:0.4
2. It is a useful tool for data reduction,
such as choosing the best variables or
cluster components for 8. observe the table below and support
analysis.[Variable Clustering] for Fries -> Burger
3. It controls for the support ( frequency)
of consequent while calculating the
conditional probability of occurrence of
{Y} given {X}. [Lift ratio]
4. Association rule mining is about
grouping similar samples into clusters.
[False]
Answer: 0.6 Answer: 0.1
9. It is the conditional probability of --------------------------------------------------------------------------
occurrence of consequent given the
antecedent. [Confidence] 15/20
10. Input validation helps to lessen what
type of anomaly? [ Insertion anomaly] 1. Market Basket Analysis creates if-Then
11. observe the table below and compute scenario rules. The IF part is called the
for the confidence of _______ (use lower case for your
Phone -> SD card
answer) [ANTECEDENT]
2. It is the conditional probability of
occurrence of consequent given the
antecedent [CONFIDENCE]
3. It controls for the support (frequency)
of consequent while calculating the
conditional probability of occurrence [Y]
given [X]. [LIFT RATIO]
4. Observe the table below and compute
for the confidence of diaper->egg.

Answer: 0.5
12. Lift ratio shows how effective the rule is
in finding consequents. [True]
13. observe the table below and compute
for the lift ratio of

[0.50]
5. Observe the table below and compute for
milk->egg.
[0.14]
6. Observe the table below and compute for
lift ratio diaper->milk.
[0.40]
8. Observe the table below and compute for
the confidence of egg->peanut.

[1.5]
7. Observe the table below and compute for
the confidence of beer->diaper.
[0.8]

9. Write the confidence formula for the following


expression: A->B
[(transactions containing both A and B
)/(transaction containing A)]
10. Lift ration shows how effective the rule is in
finding consequents. [FALSE]
11. Observe the table below and compute for the
confidence of airpods->powerbank.
15. Market Basket Analysis creates if-Then scenario
rules. The THEN part is called the _______ (use lower
case for your answer) [CONSEQUENT]

16. It is the conditional probability of occurrence of


consequent given the antecedent [CONFIDENCE]

17. Observe the table below and compute for the lift
ratio of powerbank->airpods.

[0.33]

12. Observe the table below and compute for the


support for diaper->peanut.

[0.67]

18. It pertains to how likely item Y is purchased when


item X is purchased, expressed as [X->Y]/ [confidence]

19. Which of the following is not an application of


[0.43] pattern discovery? [None of the choices]

13. Which of the following is not an application of a 20. Observe the table below and compute for the
sequential pattern? [IDENTIFYING FAKE NEWS] support for Phone case->SD card.

14. Lift ratio shows how effective the rule is in finding


consequents [TRUE]
Transaction E Beer, Peanut, Egg

Transaction F Egg, Beer, Peanut

Transaction G Beer, Diaper, Peanut


0.75 (3/7 = .43 Umulit sa number19)

5. It controls for the support (frequency) of


consequent while calculating the conditional
probability of occurrence of {Y} given {X}. Lift
ratio
6. It pertains to how popular an itemset is, as
measured by the proportion of transactions in
which an itemset appears. Support
7. It shows how effective the rule is in finding
consequents. Lift ratio
8. Observe the table below and compute for
the lift ratio of diaper→ milk.
[0.3]
Transaction ID Items

1. The objective of clustering is to uncover a Transaction A Beer, Peanut, Egg


pattern in the time series and then extrapolate
the pattern into the future. False Transaction B Beer, Milk, Peanut, Diaper
2. These are set of items, subsequences, or
substructures that occur frequently together (or
strongly correlated) in the data set. Pattern Transaction C Milk, Diaper, Egg
3. When it comes to association analysis, the more
rules you produce, the greater the risk is. True Transaction D Peanut, Egg, Diaper
4. Observe the table below and compute for
the support for diaper⇒ peanut
Transaction E Beer, Peanut, Egg
Transaction ID Items
Transaction F Egg, Beer
Transaction A Beer, Peanut, Egg
Transaction G Beer, Diaper, Peanut
Transaction B Beer, Milk, Peanut, Diaper
1.5 0.29/0.29/0.57=1.75

Transaction C Milk, Diaper, Egg


9. Supposed you want to solve a time series
Transaction D Peanut, Egg, Diaper problem where a rapid response to a real
change in the pattern of observations is
desired, which among the following is the
ideal value for your alpha ? 0.8 Transaction B Beer, Milk, Peanut, Diaper
10. Association rule mining is about grouping
similar samples into clusters. False Transaction C Milk, Diaper, Egg
11. It is the conditional probability of occurrence of
consequent given the antecedent. Confidence
12. It is the conditional probability of occurrence of Transaction D Peanut, Egg, Diaper
consequent given the antecedent. Confidence
13. Observe the table below and compute for
Transaction E Beer, Peanut, Egg
confidence of phone case→ SD Card

Transaction F Egg, Beer


Transaction ID Items

1 airpods, charger, powerbank Transaction G Beer, Diaper, Peanut


4/7.
2 powerbank, phone case
16. Market Basket Analysis creates If-Then scenario
rules. True
3 airpods, phone case, charger 17. It shows how effective the rule is in finding
consequents. Lift ratio
18. It is another type of association analysis that
4 phone case, SD Card involves using sequence data. Association rule

5 SD Card, charger, airpods

6 SD Card, phonecase, powerbank

7 Powerbank, phonecase, SD Card

8 airpods, powerbank 19. Observe the table below and compute for
the support for diaper⇒peanut
9 phone case, airpods Transaction ID Items

10 charger, SD Card, airpods Transaction A Beer, Peanut, Egg


0.5 (0.3/0.6 = 0.5)
Transaction B Beer, Milk, Peanut, Diaper
14. It pertains to how popular an itemset is, as
measured by the proportion of transactions in
which an itemset appears. (Use lowercase for Transaction C Milk, Diaper, Egg
your answer) support
15. Observe the table below and compute for
Transaction D Peanut, Egg, Diaper
the support for beer⇒ peanut

Transaction ID Items Transaction E Beer, Peanut, Egg

Transaction A Beer, Peanut, Egg Transaction F Egg, Beer, Peanut


Transaction G Beer, Diaper, Peanut
0.43 ( 3/7 = .43 Inulit nya lang nasa number 4)

20. It shows how effective the rule is in finding


consequents. Lift ratio
[IT0089] MODULE 6 – DATA PREPARATION closely maintains the integrity of the original
data.
Data Preparation
Data Preprocessing Techniques
- Data preparation is the method of
transforming raw data into a form suitable
for modeling
- It is also known as data preprocessing.

Why do we need to preprocess the data?

 MISSING VALUES
- Having null values in your data set could
Variables Selection
affect the accuracy of the model.
 OUTLIERS Observe minimizing garbage in, garbage out
- When your data has outliers, it could affect (GIGO)
the distribution of your data.
 INCONSISTENT DATA Procedures
 IMPROPERLY FORMATTED DATA 1. Backward-selection
 LIMITED FEATURES 2. Forward-selection
 THE NEED FOR TECHNIQUES SUCH
AS FEATURE ENGINEERING [M6-ST1] DATA ANOMALIES

Ways to Preprocess Data What are anomalies?

DATA CLEANING - Anomalies are problems that can occur in


poorly planned, unnormalized databases
- Data cleaning (or data cleansing) routines where all the data is stored in one table (a
attempt to fill in missing values, smooth out flat-file database).
noise while identifying outliers, and correct - Anomalies are caused when there is too
inconsistencies in the data. In this section, much redundancy in the database's
you will study basic methods for data information
cleaning
Database Anomalies
DATA INTEGRATION AND
TRANSFORMATION  Insertion Anomaly – happen when
inserting vital data into the database is not
- There are a number of issues to consider possible because other data is not already
during data integration. Schema integration there.
and object matching can be tricky. This is  Update Anomalies – happen when the
referred to as the entity identification person charged with the task of keeping all
problem. the records current and accurate, is asked,
for example, to change an employee’s title
DATA REDUCTION
due to a promotion.
- Data reduction techniques can be applied to - If the data is stored redundantly in the same
obtain a reduced representation of the data table, and the person misses any of them,
set that is much smaller in volume, yet then there will be multiple titles associated
with the employee. The end user has no way 1) Replacing missing field values with user-
of knowing which the correct title is. defined constants.
 Deletion Anomalies – happen when the
deletion of unwanted information causes
desired information to be deleted as well.
- For example, if a single database record
contains information about a particular
product along with information about a
2) Replacing missing field values with means
salesperson for the company and the
or modes
salesperson quits, then information about the
product is deleted along with salesperson
information.

Handling Missing Data

- Missing data is a problem that continues to


plague data analysis methods. HOW TO HANDLE OUTLIERS?
- Let’s examine the cars data set.
Use Graphical Methods to Identify Outliers

- A common method of ―handling‖ missing


values is simply to omit the records or
fields with missing values from the analysis.
- However, this action would make our data
bias.

Common Criteria in Handling Null Values

1. Replace the missing value with some


constant, specified by the analyst
2. Replace the missing value with the field
mean (for numeric variables) or the mode
(for categorical variables)
3. Replace the missing values with a value Variable Clustering
generated at random from the observed
distribution of the variable.
4. Replace the missing values with imputed
values based on the other characteristics of
the record.

Applications
 Collapse the categories based on the number
of observations in a category.
 Collapse the categories based on the
reduction in the chi-square test of
association between the categorical input
and the target.
 Use smoothed weight of evidence coding to
convert the categorical input to a continuous
input.

HOW TO HANDLE CATEGORICAL DATA? Anomaly Detection

Dealing with Categorical Inputs - Anomaly detection (aka outlier analysis) is a


step in data mining that identifies data
- When a categorical input has many levels, points, events, and/or observations that
expanding the input into dummy variables deviate from a dataset’s normal behavior.
can greatly increase the dimension of the - Anomalous data can indicate critical
input space. incidents, such as a technical glitch, or
- Including categorical inputs in the model potential opportunities, for instance a change
can cause quasicomplete separation. in consumer behavior.
HOW TO HANDLE NOISY DATA? Observe the graph
Handling Noisy Data

1. Binning: Binning methods smooth a sorted


data value by consulting its ―neighbor-
hood,‖ that is, the values around it.
2. Regression: Data can be smoothed by This graph shows an anomalous drop detected in
fitting the data to a function, such as with time series data. The anomaly is the yellow part of
regression. Linear regression involves the line extending far below the blue shaded area,
finding the ―best‖ line to fit two attributes which is the normal range for this metric.
(or variables), so that one attribute can be
used to predict the other. THREE MAIN CATEGORIES OF BUSINESS
3. Clustering: Outliers may be detected by DATA ANOMALIES
clustering, where similar values are
1. Global Outliers
organized into groups, or ―clusters.‖
- Also known as point anomalies, these
Intuitively, values that fall outside of the set
outliers exist far outside the entirety of a
of clusters maybe considered outliers
data set.
DEALING WITH CATEGORICAL INPUTS

Solutions to the problems of categorical inputs


include the following choices:

 Use the categorical input as a link to other


data sets.
2. Contextual Outliers
- Also called conditional outliers, these  We want to explain the data in the simplest
anomalies have values that significantly way — redundant predictors should be
deviate from the other data points that exist removed.
in the same context  Unnecessary predictors will add noise to the
estimation of other quantities that we are
interested in. Degrees of freedom will be
wasted
 Collinearity is caused by having too many
variables trying to do the same job.
 Cost: if the model is to be used for
prediction, we can save time and/or money
3. Collective Outliers
by not measuring redundant predictors.
- When a subset of data points within a set is
anomalous to the entire dataset, those values Prior to variable selection
are called collective outliers. In this
category, individual values aren’t anomalous 1. Identify outliers and influential points -
globally or contextually. maybe exclude them at least temporarily
2. Add in any transformations of the variables
that seem appropriate

Forward Selection

- The forward selection procedure starts with


no variables in the model.

Steps Forward Selection

1. For the first variable to enter the model,


select the predictor most highly correlated
with the target.
WHY YOUR COMPANY DOES NEEDS
2. For each remaining variable, compute the
ANOMALY DETECTION?
sequential Fstatistic for that variable, given
Reasons to detect anomalies the variables already in the model.
3. For the variable selected in step 2, test for
1. Anomaly detection for application the significance of the sequential F-statistic.
performance If the resulting model is not significant, then
2. Anomaly detection for product quality stop, and report the current model without
3. Anomaly detection for user experience adding the variable from step 2. Otherwise,
[M6-ST2] – VARIABLE SELECTION add the variable from step 2 into the model
and return to step 2.
Variable Selection
What if you have 2k models to choose from?
- Variable selection is intended to select the
―best‖ subset of predictors.

Why Variable Selection?


- In situations where there is a complex
hierarchy, backward elimination can be run
manually while taking account of what
variables are eligible for removal.

STEPS OF BACKWARD-SELECTION

1. Perform the regression on the full model;


that is, using all available variables. For
example, perhaps the full model has four
variables, x1, x2, x3, x4.
FORWARD SELECTION
2. For each variable in the current model,
compute the partial F-statistic. In the first
pass through the algorithm, these would be
F(x1|x2, x3, x4), F(x2|x1, x3, x4), F(x3|x1,
x2, x4), and F(x4|x1, x2, x3). Select the
variable with the smallest partial F-statistic.
Denote this value Fmin.
3. Test for the significance of Fmin. If Fmin is
not significant, then remove the variable
associated with Fmin from the model, and
return to step 2. If Fmin is significant, then
stop the algorithm and report the current
model

BACKWARD SELECTION

- This is the simplest of all variable selection


procedures and can be easily implemented
without special software.
Step 3 Then all models of two predictors are built.
Their R2, R2adj , Mallows’ Cp (see below), and s
values are calculated. The best k models are
reported

 The procedure continues in this way until


maximum number of predictors (p) is
reached. The analyst then has a listing of the
best models of each size, 1,2, …., p , to
assist in the selection of the best overall
model

EXAMPLE

STEPWISE PROCEDURE

- The stepwise procedure represents a


modification of the forward selection
procedure.
- A variable that has been entered into the
model early in the forward selection process FORMATIVES
may turn out to be nonsignificant, once other
variables have been entered into the model. Forward selection is the opposite of stepwise
selection.
Step 1 The analyst specifies how many (k) models
of each size he or she would like reported, as well  FALSE
as the maximum number of predictors (p) the Estimation is about estimating the value for the
analyst wants in the model target variable except that the target variable is
Step 2 All models of one predictor are built. Their categorical rather than numeric.
R2, R2adj , Mallows’ Cp (see below), and s values are
 True
calculated. The best k models are reported, based on
these measures. Linearity is caused by having too many variables
trying to do the same job.
 False (collinearity) This happens when the deletion of unwanted
information causes desired information to be
It is about estimating the value for the target
deleted as well.
variable except that the target variable is categorical
rather than numeric.  Insertion anomaly
 Deletion anomaly
 Estimation
The following are techniques to treat missing values
It is the process of transforming the existing
except:
features into a lower-dimensional space, typically
generating new features that are composites of the  Substitute an imputed value
existing features.  Returning to the field
 Substitute an imputed value
 Feature extraction
 Case wise deletion
Postcoding process is necessary for:
Clustering can also detect outliers.
 Structure questions
 True
A good rule of thumb in having a right amount of
Supply the missing values given for the attribute
data is to have 10 records for every predictor value.
city.
 True
City
Forward selection is the simplest variable selection
Makati
model.
Caloocan
 False
Enumerate at least one of the two(2) types of Caloocan
variables transformation commonly used in Makati
machine learning: numeric variable and categorical
variable Caloocan

 Caloocan
 Makati
Data preparation affects:
 Quezon
 The objectives of the research  None of the choices
 The quality of the research These anomalies have values that significantly
 The research approach deviate from the other data points that exist in the
 The sample size same context.
It is a manipulation of scale values to ensure
 Contextual outliers
comparability with other scales:
When there’s a missing value for a categorical
 Scale transformation variable, it is ideal to supply it by computing for the
It is a best practice to divide your dataset into train average of the data values available.
and test dataset.
 False
 True Outlier analysis can provide good product quality.
 True Answer: Data Inconsistency
A review of the questionnaires is essential in order Anomaly detection can cause a bad user experience.
to:
 False- False? hahahahaahha
 Select the data analysis strategy
You can use histogram to detect outliers.
 Find new insights
 Increase the quality of the data  True
 Increase accuracy and precise of the
collected data When a subset points within a set is anomalous to
the entire data set, those values are:
Feature selection maps the original feature space to
a new feature space with lower dimensions by  Collective outliers
combining the original feature space.
These anomalies have values that significantly
True deviate from the other data points that exist in the
same context.
False (dapat feature extraction kasi)
 Contextual outliers
This happens when inserting vital data into the
database is not possible because other data is not These are problems that can occur in poorly planed,
already there. un-normalized databases where all the data is stored
in one table (a flat-file database).
 Insertion anomaly
 Anomalies
Unnecessary predictors will add noise to the
estimation of other quantities that we are interested
in.

 True
You can also use regression when handling noisy
data.

True
- Quality of the research
Given the following values for age, what is the
problem with the data? The following can be done to treat unsatisfactory
response except:
Age
Returning to the field
16
A good rule of thumb in having a right amount of
27 data is to have 10 records for every predictor value.
True
-8990
Given are the following records for the
19
attribute rating. What is the problem with the data?
15 Data Inconsistency

18
Anomaly detection can cause a bad user experience.
application_rating True- False? hahahahaahha

1 Anomaly detection is also known as outlier


analysis. True
2 It is a best practice to divide your dataset into train
and test dataset. True
a
It fits and performs variable selection on an
ordinary least square regression predictive model.
b
Linear regression selection

c It is a useful tool for data reduction, such as


choosing the best variables or cluster components
3 for analysis. (Use lowercase for your answer)
variable clustering

The simplest of all variable selection procedures is


Supply the missing value. AGD stepwise procedure. FALSE

Degree Backward selection starts with all the variables.


True
SMBA It is about estimating the value for the target
variable except that the target variable is categorical
WMA rather than numeric. Estimation

Prior to variable selection, one must identify


AGD
outliers and influential points - maybe exclude them
at least temporarily. True
AGD
Variable clustering is about grouping the attributes
WMA with similarities. True

Resampling refers to the process of sampling at


WMA random and with replacement from a data set. True

Estimation is about estimating the value for the


-
target variable except that the target variable is
categorical rather than numeric. True
AGD
The figure below illustrates the first step in doing
backward selection.
AGD

These outliers exist far outside the entirety of a data


set. Global outliers

These are also known as point anomalies. Global False


outliers
Forward selection is the opposite of stepwise
selection. False
Data preparation affects:
Formative 6  The objectives of the research
 The quality of the research
Forward selection is the opposite of stepwise
 The research approach
selection.
 The sample size
 False It is a manipulation of scale values to ensure
Estimation is about estimating the value for the comparability with other scales:
target variable except that the target variable is
 Scale transformation
categorical rather than numeric.
It is a best practice to divide your dataset into train
 True and test dataset.
Linearity is caused by having too many variables
 True
trying to do the same job.
This happens when the deletion of unwanted
 False (collinearity dapat) information causes desired information to be
It is about estimating the value for the target deleted as well.
variable except that the target variable is categorical
 Insertion anomaly
rather than numeric.
 Deletion anomaly
 Estimation The following are techniques to treat missing values
It is the process of transforming the existing except:
features into a lower-dimensional space, typically
 Substitute an imputed value
generating new features that are composites of the
 Returning to the field
existing features.
 Substitute an imputed value
 Feature extraction  Case wise deletion

Postcoding process is necessary for: Clustering can also detect outliers.

 Structure questions  True

A good rule of thumb in having a right amount of Supply the missing values given for the attribute
data is to have 10 records for every predictor value. city.

 True City

Forward selection is the simplest variable selection Makati


model.
Caloocan
 False
Caloocan
Enumerate at least one of the two(2) types of
variables transformation commonly used in Makati
machine learning Caloocan
 Caloocan True
 Makati
Given the following values for age, what is the
 Quezon
problem with the data?
 None of the choices
These anomalies have values that significantly Age
deviate from the other data points that exist in the
16
same context.
27
 Contextual outliers
-8990
When there’s a missing value for a categorical
variable, it is ideal to supply it by computing for the 19
average of the data values available.
15
 True
18
Outlier analysis can provide good product quality.
Answer: Data Inconsistency
 True
Anomaly detection can cause a bad user experience.
A review of the questionnaires is essential in order
to:  False

 Select the data analysis strategy You can use histogram to detect outliers.
 Find new insights
 True
 Increase the quality of the data
 Increase accuracy and precise of the When a subset points within a set is anomalous to
collected data the entire data set, those values are:
Feature selection maps the original feature space to Collective outliers
a new feature space with lower dimensions by
combining the original feature space. These anomalies have values that significantly
deviate from the other data points that exist in the
True same context.
False  Contextual outliers
This happens when inserting vital data into the These are problems that can occur in poorly planed,
database is not possible because other data is not un-normalized databases where all the data is stored
already there. in one table (a flat-file database).

 Insertion anomaly  Anomalies


Unnecessary predictors will add noise to the
estimation of other quantities that we are interested
in.

 True
You can also use regression when handling noisy
data.
16/20 It is a best practice to divide your dataset into train
and test dataset. True
A homogenous data set is a data set whose data
records have the same target value. True This happens when the deletion of unwanted
information causes desired information to be
Supply for the missing values. deleted as well.

Deletion anomaly

Instead of using the real number for age attribute,


you categorized the age as the following:

Young = 12 – 17
Adult = 18 -34
Old = 35 – 60
What kind of data preparation was practiced? Data
Cleaning
It is the process of integrating multiple databases,
data cubes, or files. data integration
These are problems that can occur in poorly
planned, un-normalized databases where all the data
- 19.6 is stored in one table (a flat-file database).
Anomalies
It is a manipulation of scale values to ensure
comparability with variables with other scales: You can also use regression when handling noisy
data. True
Scale transformation
The procedure starts with an empty set of features
Supply the missing value in the given data below. [reduced set]. Forward Selection
It is the simplest of all variable selection procedures
and can be easily implemented without special
software (Use lowercase for your answer)
Backward Selection
The forward selection procedure starts with no
variables in the model. True
Estimation is about estimating the value for the
target variable except that the target variable is
categorical rather than numeric.

True
The figure below illustrates the first step in doing
backward selection. False (wala pics 😊)
It is intended to select the ―best‖ subset of
predictors. (Use lowercase for your answer)

- 88.2 Variables Selection


Enumerate at least one of the two (2) types of 5. Input validation helps to lessen the deletion
variables transformation commonly used in anomaly [FALSE]
machine learning: (Use lowercase for your answer) 6. Given the following values for age, what is
categorical variables? the problem with the data?
Age
numerical variables? 16
Forward selection is the opposite of stepwise 27
selection. False -8990
19
15
The figure below illustrates the basic steps for what 18 [DATA INCONSISTENCY]
type of variable selection method? 7. Mode is used when catering missing values
for numerical variables [FALSE]
8. A homogenous data set is a data set whose
data records have the same target value
[TRUE]
9. Post coding Process is necessary for:
[STRUCTURED QUESTIONS]
10. Given the following records for the attribute
rating. What is the problem with the data?

Application_rating
Backward
Prior to variable selection, one must identify 1
outliers and influential points - maybe exclude them
2
at least temporarily. True
A
17/20

1. It fits and performs variable selection on an B


ordinary least square regression predictive C
model [LINEAR REGRESSION
SELECTION] 3
2. It is a manipulation of scale values to ensure
[DATA INCONSISTENCY]
comparability with variables with other
scales: [SCALE TRANSFORMATION] 11. It identifies the set of input variables that
3. It is the process of integrating multiple jointly explains the maximum amount of
databases, data cubes, or files [DATA data variance. The target variable is not
INTEGRATION] considered with this method.
4. If the data is stored redundantly in the same [UNSUPERVISED SELECTION]
table, and the person misses any of them, 12. Clustering aims to discover certain features
then there will be multiple titles associated that often appear together in data [FALSE]
with the employees. This is an example of 13. Backward selection starts with no variables
what type of data anomaly? [UPDATE [FALSE]
ANOMALY] 14. Forward selection is the simplest variable
selection model [FALSE]
15. It fits and performs variable selection on an 75
ordinary least square regression predictive
model. [LINEAR REGRESSION 87
SELECTION] [88.2]
16. It identifies the set of input variables that
jointly explain the maximum amount of
variance contained in the target
8. When Stephen tried to change the section of
[UNSUPERVISED SELECTION]
all students enrolled to his class however,
17. The simplest of all variable selection
upon performing the query, only one data
procedures is stepwise procedure [FALSE]
record was modified instead of all the
18. It is intended to slect the ―best‖ subset of
records. What data anomaly was present in
predictors (use lowercase for your answer)
Stephen’s database? [UPDATE
[variable selection]
ANOMALY]
19. Forward selection is the opposite of stepwise
9. In this category, individual values aren’t
selection [FALSE]
anomalous globally or contextually
20. It is the simplest of all variable selection
[COLLECTIVE OUTLIERS]
procedures and can be easily implemented
10. It is used when there is a single
without special software (use lowercase for
measurement of each element in the sample:
your answer) [backward selection]
[INTERDEPENDENCE TECHNIQUES]
20/20 11. It fits and perform variable selection on an
ordinary least square regression predictive
1. You can also use regression when handling model [LINEAR REGRESSION
noisy data [TRUE] SELECTION]
2. When a subset of data points within a set is 12. Data preparation affects: [THE QUALITY
anomalous to the entire dataset, those values OF THE RESEARCH]
are: [COLLECTIVE OUTLIERS] 13. It performs a greedy search to find the best
3. The following can be done to treat performing feature subset. It iteratively
unsatisfactory response except: creates models and determines the best or
[ASSIGNING MISSING VALUES] the worst performing feature at each
4. A homogenous data set is a data set whose iteration [RECURSIVE FEATURE
data records have the same target value ELIMINATION]
[TRUE] 14. The first step in stepwise procedure is to
5. Post coding process is necessary for: select the predictor most highly correlated
[STRUCTURED QUESTIONS] with the target [FALSE]
6. Anomaly detection is also known as outlier 15. It involves both running the analysis to
analysis [TRUE] create unique clusters or segments and
7. Supply the missing value in the given data evaluating or describing the clusters that are
below Exam_scores created in the analysis. [CLUSTER
ANALYSIS]
100
16. Give the first step for backward selection
89 [Perform the regression on the full model]
17. A good rule of thumb in having a right
- amount of data is to have 10 records for
every predictor value [TRUE]
90
18. Forward selection is the simplest variable 10. Identify atleast one of the two principal
selection model [FALSE] reasons for eliminating a variable: (use
19. The simplest of all variable selection lowercase for your answer) [redundancy or
procedures is stepwise procedure. [FALSE] irrelevancy]
20. Clustering aims to discover certain features 11. Variable clustering is about grouping the
that often appear together in data [FALSE] attributes with similarities [TRUE]
12. Supply the missing values given for the
18/20 attribute salary
1. If the data is stored redundantly in the same Salary
table, and the person misses any of them,
then there will be multiple titles associated 16000
with the employee. This is an example of
12000
what type of data anomaly? [UPDATE
ANOMALY] 17500
2. It is a best practice to divide your dataset
into train and test dataset. [TRUE] 29000
3. You can use histogram to detect outliers
[18,625]
[TRUE]
4. Anomaly detection is also known as outlier 13. It is the process of transforming the existing
analysis [TRUE] features into a lower-dimensional space,
5. The following can be done to treat typically generating new features that are
unsatisfactory response except: composites of the existing features.
[RETURNING TO THE FIELD] [FEATURE EXTRACTION]
6. Given the following values for age, what is 14. The figure below illustrates the basic steps
the problem with the data? for what type of variable selection method?
[FORWARD SELECTION]
Age

16

27

-8990

19

15
15. Forward selection is the simplest variable
18
selection model [FALSE]
[DATA INCONSISTENCY] 16. These are variables that significantly
influence Y and so should be in the model
7. Histogram is used to see missing data but are excluded [OMITTED
[FALSE] VARIABLES]
8. Clustering can also detect outliers [TRUE] 17. Unnecessary predictors will add noise to the
9. These outliers exist far outside the entirety estimation of other quantities that we are
of a data set [GLOBAL OUTLIERS] interested in. [TRUE]
18. The first step in stepwise procedure is to
select the predictor most highly correlated WMA
with the target. [FALSE]
19. Prior to variable selection, one must identify -
outliers and influential points – maybe
exclude them at least temporarily [TRUE] AGD
20. The procedure starts with an empty set of
features [reduced set]. [FORWARD AGD
SELECTION]
4. The following are techniques to treat
16/20 missing values except: [RETURNING TO
THE FIELD]
1. Given the following values for age, what is
5. Supply for the missing values. [19.6]
the problem with the data?

Age

16

27

-8990

19

15

18

Answer: Data Inconsistency

2. The following can be done to treat


unsatisfactory response except:
[RETURNING TO THE FIELD] 6. Enumerate at least one main category of
3. Supply the missing value. AGD business data anomalies (use lowercase for
your answer) [GLOBAL OUTLIERS]
Degree 7. It is the process of integrating multiple
databases, data cubes, or files [DATA
INTEGRATION]
SMBA
8. It is a manipulation of scale values to ensure
comparability with variables with other
WMA scales: [SCALE TRANSFORMATION]
9. Resampling refers to the process of
AGD sampling at random and with replacement
from a data set [TRUE]
AGD 10. Identify at least one of the two principal
reasons for eliminating a variable: (use
WMA lowercase for your answer)
[REDUNDANCY]
11. It identifies the set of input variables that The following can be done to treat unsatisfactory
jointly explain the maximum amount of response except: Returning to the field
variance contained in the target.
A good rule of thumb in having a right amount of data is
[UNSUPERVISED LEARNING]
to have 10 records for every predictor value. True
12. It involves both running the analysis to
create unique clusters or segments and Given are the following records for the
evaluating or describing the clusters that are attribute rating. What is the problem with the
created in the analysis [CLUSTER data? Data Inconsistency
ANALYSIS]
13. Variable clustering is about grouping the
attributes with similarities [TRUE] application_rating
14. Unnecessary predictors will add noise to the
estimation of other quantities that we are 1
interested in. [TRUE]
15. When a subset of data points within a set is 2
anomalous to the entire dataset, those values
are: [COLLECTIVE OUTLIERS]
a
16. It is review of the questionnaires in order to
increase accuracy and precision of the
collected data: [EDITING] b
17. Forward selection is the opposite of stepwise
selection [FALSE] c
18. It is the simplest of all variable selection
procedures and can be easily implemented 3
without special software (use lowercase for
your answer) [BACKWARD
SELECTION] Supply the missing value. AGD
19. It starts with the full set of attributes. At
each step, it removes the worst attribute
Degree
remaining in the set [STEPWISE
FORWARD AND BACKWARD
SELECTION] OR SMBA
(STEPWISE BACKWARD
ELIMINATION) WMA

20. Backward selection starts with all the AGD


variables [TRUE]

Perform the regression on the full model; that is, AGD


using all available variables. For example, perhaps
the full model has four variables, x1, x2, x3, x4. - WMA
Backward Selection
WMA
The figure below illustrates the first step in doing
- backward selection.

AGD

AGD False
These outliers exist far outside the entirety of a data Forward selection is the opposite of stepwise selection.
set. Global outliers False
These are also known as point anomalies. Global outliers

Anomaly detection can cause a bad user experience.


True

Anomaly detection is also known as outlier analysis.


True

It is a best practice to divide your dataset into train and


test dataset. True

It fits and performs variable selection on an ordinary


least square regression predictive model. Linear
regression selection

It is a useful tool for data reduction, such as choosing


the best variables or cluster components for analysis.
(Use lowercase for your answer) variable clustering

The simplest of all variable selection procedures is


stepwise procedure. FALSE

Backward selection starts with all the variables. True

It is about estimating the value for the target variable


except that the target variable is categorical rather than
numeric. Estimation

Prior to variable selection, one must identify outliers


and influential points - maybe exclude them at least
temporarily. True

Variable clustering is about grouping the attributes with


similarities. True

Resampling refers to the process of sampling at random


and with replacement from a data set. True

Estimation is about estimating the value for the target


variable except that the target variable is categorical
rather than numeric. True

You might also like