You are on page 1of 22

Data Mining Questions and Answers

Q1) What is data mining? In your answer, address the following:

a. Is it another hype?
b. Is it a simple transformation of technology developed from databases,
statistics, and machine learning?
c. Describe the steps involved in data mining when viewed as a process
of knowledge discovery.
d. Explain how the evolution of database technology led to data mining?

Data mining refers to the process or method that extracts or “mines” interesting
knowledge or patterns from large amounts of data.

(a) Is it another hype?

Data mining is not another hype. Instead, the need for data mining has
arisen due to the wide availability of huge amounts of data and the imminent need
for turning such data into useful information and knowledge. Thus, data mining
can be viewed as the result of the natural evolution of information technology.

(b) Is it a simple transformation of technology developed from databases,


statistics, and machine learning?

No. Data mining is more than a simple transformation of technology developed


from databases, statistics, and machine learning. Instead, data mining involves an
integration, rather than a simple transformation, of techniques from multiple
disciplines such as database technology, statistics, machine learning, high-
performance computing, pattern recognition, neural networks, data visualization,
information retrieval, image and signal processing, and spatial data analysis.
(c) Describe the steps involved in data mining when viewed as a process of
knowledge discovery.

The steps involved in data mining when viewed as a process of knowledge


discovery are as follows:

• Data cleaning, a process that removes or transforms noise and inconsistent data

• Data integration, where multiple data sources may be combined

• Data selection, where data relevant to the analysis task are retrieved from the
database .

• Data transformation, where data are transformed or consolidated into forms


appropriate for mining .

• Data mining, an essential process where intelligent and efficient methods are
applied in order to extract patterns .

• Pattern evaluation, a process that identifies the truly interesting patterns


representing knowledge based on some interestingness measures .

• Knowledge presentation, where visualization and knowledge representation


techniques are used to present the mined knowledge to the user.

(d) Explain how the evolution of database technology led to data mining?

Database technology began with the development of data collection and database
creation mechanisms that led to the development of effective mechanisms for data
management including data storage and retrieval, and query and transaction
processing. The large number of database systems offering query and transaction
processing eventually and naturally led to the need for data analysis and
understanding. Hence, data mining began its development out of this necessity
Q2) In real-world data, tuples with missing values for some attributes are a
common occurrence. Describe various methods for handling this problem.?

The various methods for handling the problem of missing values in data tuples
include:
1. Ignoring the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification or description). This method is not very effective
unless the tuple contains several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies considerably.

2. Manually filling in the missing value: In general, this approach is time-consuming and
may not be a reasonable task for large data sets with many missing values, especially
when the value to be filled in is not easily determined.

3. Using a global constant to fill in the missing value: Replace all missing attribute values
by the same constant, such as a label like “Unknown,” or −∞. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that they
form an interesting concept, since they all have a value in common — that of
“Unknown.” Hence, although this method is simple, it is not recommended.

4. Using the attribute mean for quantitative (numeric) values or attribute mode for
categorical (nominal) values: For example, suppose that the average income of
AllElectronics customers is $28,000. Use this value to replace any missing values for
income.

5. Using the attribute mean for quantitative (numeric) values or attribute mode for
categorical (nominal) values, for all samples belonging to the same class as the given
tuple: For example, if classifying customers according to credit risk, replace the missing
value with the average income value for customers in the same credit risk category as that
of the given tuple.

6. Using the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using Bayesian formalism, or decision tree induction.
For example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
Q3) A data warehouse can be modeled by either a star schema or a
snowflake schema. Briefly describe the difference between the two
models?

star schema, in which the data warehouse contains :

(1) a large central table (fact table) containing the bulk of the data, with no
redundancy.

(2) a set of smaller attendant tables (dimension tables), one for each
dimension. The schema graph resembles a starburst, with the dimension
tables displayed in a radial pattern around the central fact table.

snowflake schema, where some dimension tables are normalized, there by


further splitting the data into additional tables. The resulting schema graph
forms a shape similar to a snowflake.

The major difference between the snowflake and star schema models is that
the dimension tables of the snowflake model may be kept in normalized
form to reduce redundancies.

They are similar in the sense that they all have a fact table, as well as some
dimensional tables.

The major difference is that some dimension tables in the snowflake schema
are normalized, thereby further splitting the data into additional tables.

The advantage for star schema is its simplicity, which will enable efficiency,
but it requires more space.

For snowflake schema, it reduce some redundancy by sharing common


tables: The tables are easy to maintain and save some space. However, it is
less efficient, and the saving of space is negligible in comparison with the
typical magnitude of the fact table.

Therefore, empirically, star schema is better simply because nowadays,


efficiency has higher priority over space, if it is not too huge. Sometimes in
industry, to speed up processing, people “denormalize data from a snowflake
schema into a star schema.Another option here is that “some practitioners
use a snowflake schema to maintain dimensions, and then present users with

the same data collapsed into a star


Q4) Describe the decision tree Algorithm and Naïve Bayes Algorithm?

Decision Tree Induction

Decision tree induction is the learning of decision trees from class-labeled


training tuples.

A decision tree is a flowchart-like tree structure, where each internal node


(non-leaf node) denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (or terminal node) holds a class
label. The topmost node in a tree is the root node.

Figure 8.2 illustrate A decision tree for the concept buys computer,
indicating whether an All Electronics customer is likely to purchase a
computer. Each internal (non-leaf) node represents a test on an attribute.
Each leaf node represents a class (either buys computer D yes or buys
computer D no).

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)

-Tree is constructed in a top-down recursive divide-and conquer manner .

- Starts with a training set of tuples and their associated class labels.

- The training set is recursively partitioned into smaller subsets as


the tree is being built.

The algorithm is called with three parameters:

1- data partition (refer as D)

2- attribute_list

3-Attribute_selection_method.

Note (Decision tree are commonly used for gaining information for the
purpose of decision -making.)

decision tree algorithm known as ID3(Iterative Dichotomiser)

Algorithm:

Generate a decision tree from the training tuples of data partition, D.

Input:

Data partition : D, which is a set of training tuples and their associated class labels.

attribute_list : the set of candidate attributes.

Attribute_selection_method : a procedure to determine the splitting criterion that


“best” partitions the data tuples into individual classes.

This criterion consists of a splitting_attribute and, possibly, either a split-point or


splitting subset.

Output:

A decision tree.
Method:

(1) create a node N;

(2) if tuples in D are all of the same class, C, then

(3) return N as a leaf node labeled with the class C;

(4) if attribute list is empty then

(5) return N as a leaf node labeled with the majority class in D; // majority

voting

(6) apply Attribute selection method(D, attribute list) to find the “best”

splitting criterion;

(7) label node N with splitting_ criterion;

(8) if splitting_attribute is discrete-valued and

multiway splits allowed then // not restricted to binary trees

(9) attribute_list ←attribute_list − splitting attribute; //remove splitting attribute

(10) for each outcome j of splitting_criterion

// partition the tuples and grow subtrees for each partition

(11) let 𝐷𝑗 be the set of data tuples in D satisfying outcome j ; // a partition

(12) if 𝐷𝑗 is empty then

(13) attach a leaf labeled with the majority class in D to node ;

(14) else attach the node returned by Generate decision tree

(𝐷𝑗 , attribute list) to node N;

end for

(15) return N;
Explain the algorithm

If the tuples in D are all of the same class, then node N becomes a leaf and is
labeled with that class (steps 2 and 3).

Note that steps 4 and 5 are terminating conditions.

Otherwise, the algorithm calls Attribute_selection_method to determine the


splitting criterion.

-Which tells us which attribute to test at node N by determining

the “best” partition the tuples in D into individual classes(step 6).

- And tells us which branches to grow from node N with respect

to the outcomes of the chosen test.

-Indicates the splitting attribute and may also indicate either a

split-point or a splitting subset.

The resulting partitions at each branch are as “pure” as possible.

-A partition is pure if all the tuples in it belong to the same class.

The node N is labeled with the splitting criterion, which serves as a test at the node
(step 7). A branch is grown from node N for each of the outcomes of the splitting
criterion. The tuples in D are partitioned accordingly (steps 10 to 11). There are
three possibilities for partitioning tuples based on the splitting criterion, each with
examples.

Let A be the splitting attribute.

A is discrete-valued:

If A is discrete-valued,

then one branch is grown

for each known value of A.

2. A is continuous-valued :
If A is continuous-valued, then two branches

are grown, corresponding to

A split point and A > split point.

3. A is discrete-valued and a binary tree must be produced:

If A is discrete-valued and a binary tree

must be produced, then the test

is of the form A ∈ "S" _"A" ,

where "S" _"A" is the splitting subset for A.

The algorithm uses the same process recursively to form a decision tree for the
tuples at each resulting partition, 𝐷_ , of D (step 14).

The recursive partitioning stops only when any one of the following terminating
conditions is true:

All the tuples in partition D belong to the same class

There are no remaining attributes on which the tuples may be further partitioned
(step 4). In this case, majority voting is employed for classifying the leaf (step 5).

There are no tuples for a given branch, that is, a partition 𝐷_𝑗 is empty (step 12). In
this case, a leaf is created with the majority class in D

(step 13).

The resulting decision tree is returned (step 15).

The computational complexity of the algorithm given training set D is


(𝑛×|𝐷|×log⁡〖(|𝐷|))〗

Where n is the number of attributes describing the tuples in D and |𝐷| is the number
of training tuples in D.
the computational cost of growing a tree grows at most

𝑛×|𝐷|×log⁡〖(|𝐷|) 〗

Naïve Bayesian Classification

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

1. Let D be a training set of tuples and their associated class labels, and each tuple
is represented by an n-D attribute vector X = (x1, x2, …, xn) .

2. Suppose there are m classes C1, C2, …, Cm.

Given a tuple, X, the classifier will predict that X belongs to the class having the
highest posterior probability, conditioned on X.

The naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and
only if

P(𝐶_𝑖 |X)> P(𝐶_𝑗 |X) for 1≤𝑗≤𝑚 , 𝑗≠𝑖

Thus, we maximize P(𝐶_𝑖/X) is called the maximum posteriori hypothesis. By


Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
3. Since P(X) is constant for all classes, only P(C | X)  P(X | C )P(C )
i i i
needs to be maximized

-To maximize P(X|𝐶𝑖 )P(𝐶𝑖 ), we need to know class prior probabilities

-If the probabilities are not known, assume that

P(C1)=P(C2)=…=P(Cm) ⇒ maximize P(X|𝐶𝑖 )

- Class prior probabilities can be estimated by P(𝐶𝑖 )=|𝐶𝑖 ,D|/|D|

where |𝐶𝑖 ,D| is the number of training tuples of class 𝐶𝑖 in D.

4. Assume Class Conditional Independence to reduce computational

cost of P(X |𝐶𝑖 ) .

- given X(𝒙𝟏 ,…, 𝒙𝒏 ), P(X |𝐶𝑖 ) is:

n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i )  ...  P( x | C i )
k 1 2 n
k 1

- The probabilities P(𝑋1 |𝐶𝑖 ) ,… P(𝑋𝑛 |𝐶𝑖 ) can be estimated from the

training tuples.

To compute P(X |𝐶𝑖 ) , we consider the following:

(a) If 𝐴𝑘 is categorical, P(𝑋𝑘 |𝐶𝑖 ) is the number of tuples in 𝐶𝑖 having value 𝑋𝑘 for
𝐴𝑘 divided by |Ci, D| (number of tuples of 𝐶𝑖 in D)

(b) If Ak is continous-valued, P(𝑋𝑘 |𝐶𝑖 ) is usually computed based on Gaussian


distribution with a mean μ and standard deviation σ
( x )2
1 
g ( x,  ,  )  e 2
2

2 
so that P(𝑋𝑘 |𝐶𝑖 )  g ( xk , Ci ,  Ci )

5. To predict the class label of X, P(𝑋𝑗 | 𝐶𝑖 )p(𝐶𝑖 ) is evaluated for each class 𝐶𝑖

p(X|Ci )p(𝐶𝑖 ) > p(X|Cj )p(Cj ) for 1≤ 𝑗 ≤ 𝑚 , 𝑗 ≠ 𝑖


Q5) Suppose that your local bank has a data mining system. The bank has
been studying your debit card usage patterns. Noticing that you make many
transactions at home renovation stores, the bank decides to contact you,
offering information regarding their special loans for home improvements.

a) Discuss how this may conflict with your right to privacy.


b) Describe another situation in which you feel that data mining can infringe
on your privacy.

a)They did not take my permission to do that and This is considered a spy on my
privacy and life.

b): Social media and E_commerce.


Q6) What is the difference between Agglomerative and Divisive Hierarchical
Clustering ,illustrate by example?

either agglomerative or divisive, depending on whether the hierarchical


decomposition is formed in a bottom-up (merging) or top down (splitting) fashion.

An agglomerative hierarchical clustering method uses a bottom-up strategy. It


typically starts by letting each object form its own cluster and iteratively merges
clusters into larger and larger clusters, until all the objects are in a single cluster or
certain termination conditions are satisfied. The single cluster becomes the
hierarchy’s root. For the merging step, it finds the two clusters that are closest to
each other (according to some similarity measure), and combines the two to form
one cluster. Because two clusters are merged per iteration, where each cluster
contains at least one object, an agglomerative method requires at most n iterations.

A divisive hierarchical clustering method employs a top-down strategy. It starts


by placing all objects in one cluster, which is the hierarchy’s root. It then divides
the root cluster into several smaller subclusters, and recursively partitions those
clusters into smaller ones. The partitioning process continues until each cluster at
the lowest level is coherent enough—either containing only one object, or the
objects within a cluster are sufficiently similar to each other.
Example

 Agglomerative versus divisive hierarchical clustering. Figure 10.6 shows


the application of AGNES (AGglomerative NESting), an agglomerative
hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive
hierarchical clustering method, on a data set of five objects, {a,b, c,d,} eg.
Initially, AGNES, the agglomerative method, places each object into a
cluster of its own. The clusters are then merged step-by-step according to
some criterion. For example, clusters C1 and C2 may be merged if an object
in C1 and an object in C2 form the minimum Euclidean distance between
any two objects from different clusters.
Q7) What is the difference between OLTP and OLAP?

online transaction processing (OLTP) :


Systems perform online transaction and query processing.
They cover most of the day-to-day operations of an organization Such as
purchasing, inventory, manufacturing, banking, payroll, registration, and
accounting. Data warehouse systems, and serve users or knowledge workers in
the role of data analysis and decision making.
Systems perform online analytical processing.

online analytical processing (OLAP):


systems can organize and present data in various formats in order to
accommodate the diverse needs of different users.
features OLTP OLAP
Users and system  customer-oriented  market-oriented
orientation  is used for transaction and  is used for data analysis
query processing by by knowledge workers,
clerks, clients, and including managers,
information technology executives, and analysts.
professionals.
Data contents  manages current data that,  manages large amounts
typically, are too detailed of historic data, provides
to be easily used for facilities for
decision making. summarization
aggregation, stores and
manages information at
different levels of
granularity.
Database design  adopts an entity-  adopts either a star or a
relationship (ER) data snowflake model and a
model and an application- subject-oriented database
oriented database design. design.
View  focuses mainly on the  spans multiple versions
current data within an of a database schema,
enterprise or department, due to the evolutionary
without referring to process of an
historic data or data in organization.
different organizations.  deal with information
that originates from
different organizations,
integrating information
from many
data stores.
 data are stored on
multiple storage media.
Access patterns  consist mainly of short,  are mostly read-only
atomic transactions. Such operations (because
a system requires most data warehouses
concurrency control and store historic rather than
recovery mechanisms up-to-date information),
although many could be
complex queries.
Q8) Differentiate between Data Mining and Data warehousing?

Data mining is the process of finding patterns in a given data set. These patterns
can often provide meaningful and insightful data to whoever is interested in that
data. Data mining is used today in a wide variety of contexts – in fraud detection,
as an aid in marketing campaigns, and even supermarkets use it to study their
consumers.

Data warehousing can be said to be the process of centralizing or aggregating data


from multiple sources into one common repository

Example of data mining

If you’ve ever used a credit card, then you may know that credit card companies
will alert you when they think that your credit card is being fraudulently used by
someone other than you. This is a perfect example of data mining – credit card
companies have a history of your purchases from the past and know geographically
where those purchases have been made. If all of a sudden some purchases are made
in a city far from where you live, the credit card companies are put on alert to a
possible fraud since their data mining shows that you don’t normally make
purchases in that city. Then, the credit card company can disable your card for that
transaction or just put a flag on your card for suspicious activity.
Example of data warehousing

A great example of data warehousing is Facebook , everyone can relate to is what


Facebook does. Facebook basically gathers all of your data – your friends, your
likes, who you stalk, etc – and then stores that data into one central repository.
Even though Facebook most likely stores your friends, your likes, etc, in separate
databases, they do want to take the most relevant and important information and
put it into one central aggregated database. Why would they want to do this? For
many reasons – they want to make sure that you see the most relevant ads that
you’re most likely to click on, they want to make sure that the friends that they
suggest are the most relevant to you, etc – keep in mind that this is the data mining
phase, in which meaningful data and patterns are extracted from the aggregated
data. But, underlying all these motives is the main motive: to make more money –
after all, Facebook is a business.

Remember that data warehousing is a process that must occur before any data
mining can take place. In other words, data warehousing is the process of
compiling and organizing data into one common database, and data mining is the
process of extracting meaningful data from that database. The data mining process
relies on the data compiled in the datawarehousing phase in order to detect
meaningful patterns.

Data mining is the process of extracting data from large data sets.

Data warehousing is the process of pooling all relevant data together.

Both data mining and data warehousing are business intelligence collection tools.

Data mining is specific in data collection.

Data warehousing is a tool to save time and improve efficiency by bringing data
from different location from different areas of the organization together.

Data warehouse has three layers, namely staging, integration and access.
Q9) Explain Association algorithm in Data mining?
Q10) Suppose that the data mining task is to cluster points (with (x, y)
representing location) into three clusters, where the points are

Al (2,10), A2 (2,5), A3 (8,4),B1 (5,8),B2(7,5), B3(6,4), CI (1,2), C2(4,9).

The distance function is Euclidean distance. Suppose initially we assign Ai, B1,
and CI as the center of each cluster, respectively. Use the k-means algorithm to
show only (a) The three cluster centers after the first round of execution. (b) The
final three clusters.

(a) After the first round, the three new clusters are:
(1) {A1 }, (2) {B1 , A3 , B2 ,B3 , C2 }, (3) {C1 , A2 },
and their centers are
(1) (2, 10), (2) (6, 6), (3) (1.5, 3.5).

(b) The final three clusters are:

(1) {A1 , C2 , B1 }, (2) {A3 , B2 , B3 }, (3) {C1 ,A2 }.


Q11) How is a data warehouse different from a database? How are they
similar?

Answer:

• Differences between a data warehouse and a database: A data warehouse is a


repository of information collected from multiple sources, over a history of time,
stored under a unified schema, and used for data analysis and decision support;

whereas a database, is a collection of interrelated data that represents the current


status of the stored data. There could be multiple heterogeneous databases where

the schema of one database may not agree with the schema of another. A database
system supports ad-hoc query and on-line transaction processing.

• Similarities between a data warehouse and a database: Both are repositories of


information, storing huge amounts of persistent data.

You might also like