Professional Documents
Culture Documents
a. Is it another hype?
b. Is it a simple transformation of technology developed from databases,
statistics, and machine learning?
c. Describe the steps involved in data mining when viewed as a process
of knowledge discovery.
d. Explain how the evolution of database technology led to data mining?
Data mining refers to the process or method that extracts or “mines” interesting
knowledge or patterns from large amounts of data.
Data mining is not another hype. Instead, the need for data mining has
arisen due to the wide availability of huge amounts of data and the imminent need
for turning such data into useful information and knowledge. Thus, data mining
can be viewed as the result of the natural evolution of information technology.
• Data cleaning, a process that removes or transforms noise and inconsistent data
• Data selection, where data relevant to the analysis task are retrieved from the
database .
• Data mining, an essential process where intelligent and efficient methods are
applied in order to extract patterns .
(d) Explain how the evolution of database technology led to data mining?
Database technology began with the development of data collection and database
creation mechanisms that led to the development of effective mechanisms for data
management including data storage and retrieval, and query and transaction
processing. The large number of database systems offering query and transaction
processing eventually and naturally led to the need for data analysis and
understanding. Hence, data mining began its development out of this necessity
Q2) In real-world data, tuples with missing values for some attributes are a
common occurrence. Describe various methods for handling this problem.?
The various methods for handling the problem of missing values in data tuples
include:
1. Ignoring the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification or description). This method is not very effective
unless the tuple contains several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies considerably.
2. Manually filling in the missing value: In general, this approach is time-consuming and
may not be a reasonable task for large data sets with many missing values, especially
when the value to be filled in is not easily determined.
3. Using a global constant to fill in the missing value: Replace all missing attribute values
by the same constant, such as a label like “Unknown,” or −∞. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that they
form an interesting concept, since they all have a value in common — that of
“Unknown.” Hence, although this method is simple, it is not recommended.
4. Using the attribute mean for quantitative (numeric) values or attribute mode for
categorical (nominal) values: For example, suppose that the average income of
AllElectronics customers is $28,000. Use this value to replace any missing values for
income.
5. Using the attribute mean for quantitative (numeric) values or attribute mode for
categorical (nominal) values, for all samples belonging to the same class as the given
tuple: For example, if classifying customers according to credit risk, replace the missing
value with the average income value for customers in the same credit risk category as that
of the given tuple.
6. Using the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using Bayesian formalism, or decision tree induction.
For example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
Q3) A data warehouse can be modeled by either a star schema or a
snowflake schema. Briefly describe the difference between the two
models?
(1) a large central table (fact table) containing the bulk of the data, with no
redundancy.
(2) a set of smaller attendant tables (dimension tables), one for each
dimension. The schema graph resembles a starburst, with the dimension
tables displayed in a radial pattern around the central fact table.
The major difference between the snowflake and star schema models is that
the dimension tables of the snowflake model may be kept in normalized
form to reduce redundancies.
They are similar in the sense that they all have a fact table, as well as some
dimensional tables.
The major difference is that some dimension tables in the snowflake schema
are normalized, thereby further splitting the data into additional tables.
The advantage for star schema is its simplicity, which will enable efficiency,
but it requires more space.
Figure 8.2 illustrate A decision tree for the concept buys computer,
indicating whether an All Electronics customer is likely to purchase a
computer. Each internal (non-leaf) node represents a test on an attribute.
Each leaf node represents a class (either buys computer D yes or buys
computer D no).
- Starts with a training set of tuples and their associated class labels.
2- attribute_list
3-Attribute_selection_method.
Note (Decision tree are commonly used for gaining information for the
purpose of decision -making.)
Algorithm:
Input:
Data partition : D, which is a set of training tuples and their associated class labels.
Output:
A decision tree.
Method:
(5) return N as a leaf node labeled with the majority class in D; // majority
voting
(6) apply Attribute selection method(D, attribute list) to find the “best”
splitting criterion;
end for
(15) return N;
Explain the algorithm
If the tuples in D are all of the same class, then node N becomes a leaf and is
labeled with that class (steps 2 and 3).
The node N is labeled with the splitting criterion, which serves as a test at the node
(step 7). A branch is grown from node N for each of the outcomes of the splitting
criterion. The tuples in D are partitioned accordingly (steps 10 to 11). There are
three possibilities for partitioning tuples based on the splitting criterion, each with
examples.
A is discrete-valued:
If A is discrete-valued,
2. A is continuous-valued :
If A is continuous-valued, then two branches
The algorithm uses the same process recursively to form a decision tree for the
tuples at each resulting partition, 𝐷_ , of D (step 14).
The recursive partitioning stops only when any one of the following terminating
conditions is true:
There are no remaining attributes on which the tuples may be further partitioned
(step 4). In this case, majority voting is employed for classifying the leaf (step 5).
There are no tuples for a given branch, that is, a partition 𝐷_𝑗 is empty (step 12). In
this case, a leaf is created with the majority class in D
(step 13).
Where n is the number of attributes describing the tuples in D and |𝐷| is the number
of training tuples in D.
the computational cost of growing a tree grows at most
𝑛×|𝐷|×log〖(|𝐷|) 〗
1. Let D be a training set of tuples and their associated class labels, and each tuple
is represented by an n-D attribute vector X = (x1, x2, …, xn) .
Given a tuple, X, the classifier will predict that X belongs to the class having the
highest posterior probability, conditioned on X.
The naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and
only if
n
P( X | C i) P( x | C i) P( x | C i) P( x | C i ) ... P( x | C i )
k 1 2 n
k 1
- The probabilities P(𝑋1 |𝐶𝑖 ) ,… P(𝑋𝑛 |𝐶𝑖 ) can be estimated from the
training tuples.
(a) If 𝐴𝑘 is categorical, P(𝑋𝑘 |𝐶𝑖 ) is the number of tuples in 𝐶𝑖 having value 𝑋𝑘 for
𝐴𝑘 divided by |Ci, D| (number of tuples of 𝐶𝑖 in D)
2
so that P(𝑋𝑘 |𝐶𝑖 ) g ( xk , Ci , Ci )
5. To predict the class label of X, P(𝑋𝑗 | 𝐶𝑖 )p(𝐶𝑖 ) is evaluated for each class 𝐶𝑖
a)They did not take my permission to do that and This is considered a spy on my
privacy and life.
Data mining is the process of finding patterns in a given data set. These patterns
can often provide meaningful and insightful data to whoever is interested in that
data. Data mining is used today in a wide variety of contexts – in fraud detection,
as an aid in marketing campaigns, and even supermarkets use it to study their
consumers.
If you’ve ever used a credit card, then you may know that credit card companies
will alert you when they think that your credit card is being fraudulently used by
someone other than you. This is a perfect example of data mining – credit card
companies have a history of your purchases from the past and know geographically
where those purchases have been made. If all of a sudden some purchases are made
in a city far from where you live, the credit card companies are put on alert to a
possible fraud since their data mining shows that you don’t normally make
purchases in that city. Then, the credit card company can disable your card for that
transaction or just put a flag on your card for suspicious activity.
Example of data warehousing
Remember that data warehousing is a process that must occur before any data
mining can take place. In other words, data warehousing is the process of
compiling and organizing data into one common database, and data mining is the
process of extracting meaningful data from that database. The data mining process
relies on the data compiled in the datawarehousing phase in order to detect
meaningful patterns.
Data mining is the process of extracting data from large data sets.
Both data mining and data warehousing are business intelligence collection tools.
Data warehousing is a tool to save time and improve efficiency by bringing data
from different location from different areas of the organization together.
Data warehouse has three layers, namely staging, integration and access.
Q9) Explain Association algorithm in Data mining?
Q10) Suppose that the data mining task is to cluster points (with (x, y)
representing location) into three clusters, where the points are
The distance function is Euclidean distance. Suppose initially we assign Ai, B1,
and CI as the center of each cluster, respectively. Use the k-means algorithm to
show only (a) The three cluster centers after the first round of execution. (b) The
final three clusters.
(a) After the first round, the three new clusters are:
(1) {A1 }, (2) {B1 , A3 , B2 ,B3 , C2 }, (3) {C1 , A2 },
and their centers are
(1) (2, 10), (2) (6, 6), (3) (1.5, 3.5).
Answer:
the schema of one database may not agree with the schema of another. A database
system supports ad-hoc query and on-line transaction processing.