DWM - Notes Unit 1 To Unit 5

G H RAISONI
Department of Computer UNIVERSITY

Science and Engineering
(Established Under UGC (2f) and Madhya Pradesh Niji Vishwavidyalaya (Sthapana evam
Sanchalan) Adhiniyam Act No. 17 of 2007), Gram Dhoda Borgaon, Village-Saikheda,
Teh-Saunsar, Dist.-Chhindwara, (M.P.) – 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: info@ghru.edu.in
School of Engineering & Technology

Teacher Assessment Examination (TAE)
DWM Notes Course Name : Data Warehousing and Mining
Department : B.Tech CSE Semester : I
Unit No: 01 to 05
Name of Faculty: Mr. Abhishek Kundu
Session 2023-24
Discuss Functionalities of Data Mining.
Ans: Data mining is a process that involves extracting useful information

and patterns from large datasets. It is a powerful tool that enables
organizations to gain valuable insights and make data-driven decisions.
The functionalities of data mining are vast and varied, making it an
indispensable tool for businesses across various industries.
One of the key functionalities of data mining is the ability to identify

patterns and relationships within datasets. By analyzing large volumes of
data, data mining algorithms can uncover hidden patterns that may not
be apparent to the human eye. This can include relationships between
variables, trends over time, or correlations between different data
points. These patterns can provide businesses with valuable insights that
can be used to improve decision-making, optimize processes, or identify
new opportunities.
Another important functionality of data mining is predictive modeling. By

analyzing historical data and identifying patterns, data mining algorithms
G H RAISONI
can make accurate predictions about future outcomes. This can be

particularly useful for businesses in forecasting sales, customer behavior,
or market trends. Predictive modeling can also help in identifying
potential risks or fraudulent activities, enabling businesses to take
proactive measures to mitigate them.
Data mining also plays a crucial role in customer segmentation and

targeting. By analyzing customer data, businesses can identify different
segments of customers based on their behavior, preferences, or
demographics. This information can then be used to tailor marketing
campaigns, develop personalized recommendations, or improve
customer service. By understanding the unique needs and preferences of
different customer segments, businesses can enhance customer
satisfaction and drive revenue growth.
Furthermore, data mining enables businesses to perform market basket

analysis. This involves analyzing customer purchase data to identify
associations and patterns between different products. By understanding
which products are often purchased together, businesses can optimize
their product placement, cross-selling, and upselling strategies. This can
lead to higher sales and increased customer satisfaction.
Data mining also provides valuable insights for risk management and
fraud detection. By analyzing historical data, businesses can identify
patterns or anomalies that may indicate potential risks or fraudulent
activities. This can include detecting unusual transactions, identifying
suspicious behaviors, or detecting patterns of fraud. By leveraging data
mining techniques, businesses can proactively detect and prevent
potential risks or fraudulent activities, saving both time and money.
In conclusion, the functionalities of data mining are extensive and

diverse. From identifying patterns and relationships within datasets to
predictive modeling, customer segmentation, market basket analysis,
and risk management, data mining enables businesses to extract
valuable insights and make informed decisions. With the exponential
G H RAISONI

growth of data, data mining has become an essential tool for businesses
to gain a competitive edge and drive growth in today's data-driven
world.
Classify Data Mining system.
Ans: n the rapidly evolving world of technology, data is considered the new oil.
Organizations of all sizes and across various industries are increasingly relying
on data to gain valuable insights and make informed decisions. However, with
the ever-increasing volume of data being generated, it has become crucial to
have effective tools and techniques to extract meaningful information from this
vast sea of data. This is where data mining systems come into play.
Data mining systems are powerful tools that allow organizations to discover
patterns, correlations, and relationships within their data. These systems use a
combination of statistical algorithms, machine learning techniques, and artificial
intelligence to analyze large amounts of data and uncover hidden patterns and
trends. By doing so, organizations can gain valuable insights that can be used for
various purposes such as business intelligence, marketing strategies, risk
assessment, and fraud detection, among others.
There are several types of data mining systems, each with its own unique
characteristics and applications. Let's explore some of the most common types:
Association Rule Mining: This type of data mining system focuses on

discovering relationships or associations between different variables in a dataset.
It helps identify patterns such as "customers who bought product A also bought
product B." Association rule mining is widely used in market basket analysis,
where retailers analyze customer purchase patterns to optimize product
placement and cross-selling.
Classification and Prediction: Classification data mining systems are used to

categorize data into predefined classes or groups based on certain attributes or
variables. These systems use algorithms to build models that can predict future
outcomes or assign new data to specific categories. For example, they can be
used in credit scoring to predict the likelihood of a customer defaulting on a loan
based on their credit history.
Clustering: Clustering data mining systems group similar data points together
based on their characteristics or attributes. This helps identify similarities or
patterns within the data that might not be initially apparent. Clustering is
commonly used in customer segmentation, where organizations group customers
G H RAISONI

with similar characteristics to target them with personalized marketing
campaigns.
Regression: Regression data mining systems are used to analyze the relationship
between a dependent variable and one or more independent variables. They help
organizations understand how changes in one variable affect the others.
Regression analysis is frequently used in sales forecasting, where historical sales
data is analyzed to predict future sales based on various factors such as price,
promotion, and seasonality.
Anomaly Detection: Anomaly detection data mining systems focus on

identifying unusual or abnormal patterns within a dataset. They help
organizations detect outliers or anomalies that deviate significantly from the
expected behavior. Anomaly detection can be applied in various domains such
as fraud detection, network intrusion detection, and quality control.
It's worth noting that these types of data mining systems are not mutually
exclusive, and often multiple techniques are used in combination to extract the
most valuable insights from the data. Additionally, the choice of data mining
system depends on the specific requirements and objectives of an organization.
In conclusion, data mining systems play a crucial role in extracting meaningful

insights from large datasets. They help organizations uncover patterns,
relationships, and trends that can be used to drive informed decision-making. By
understanding the different types of data mining systems, organizations can
choose the most appropriate techniques to suit their specific needs and gain a
competitive edge in today's data-driven world.
Write about Data preprocessing.

Ans: Data preprocessing plays a crucial role in data mining as it helps transform
raw data into a clean and organized format that can be easily analyzed and used
for decision-making. It is an essential step in the data mining process as the
quality of the data directly impacts the accuracy and effectiveness of the
analysis.
In simple terms, data preprocessing refers to the techniques and methods used to
prepare data for analysis. It involves cleaning the data, handling missing values,
dealing with outliers, normalizing the data, and reducing dimensionality. By
performing these preprocessing steps, data mining algorithms can more
effectively extract meaningful patterns and insights from the data.
G H RAISONI
One of the primary tasks in data preprocessing is data cleaning. Real-world data
is often incomplete, noisy, and inconsistent. The cleaning process involves
removing or correcting inaccurate data, dealing with missing values, and
handling duplicate or inconsistent data. For example, if a dataset contains
missing values, analysts can choose to either remove those records or impute the
missing values using techniques such as mean imputation or regression
imputation.
Another important step is outlier detection and handling. Outliers are data points
that significantly deviate from the normal pattern and can have a significant
impact on the analysis. Identifying and dealing with outliers is essential to
ensure that they do not skew the results. Outliers can be identified using
statistical techniques such as Z-score or using domain knowledge.
Data normalization is another crucial preprocessing step. It involves

transforming the data so that it falls within a specific range or follows a specific
distribution. Normalization is particularly important when dealing with data of
different units or scales. Common normalization techniques include min-max
normalization, z-score normalization, and decimal scaling.
Dimensionality reduction is also an integral part of data preprocessing in data

mining. It aims to reduce the number of input variables while retaining the most
relevant information. High-dimensional data can be computationally expensive
and may lead to overfitting. Techniques such as Principal Component Analysis
(PCA) and feature selection help in reducing the dimensionality of the data.
Data preprocessing also includes other tasks such as data integration, where
multiple datasets are combined, and data transformation, where the data is
converted into a suitable format for analysis. These steps ensure that the data is
in a standardized and consistent format, making it easier for data mining
algorithms to extract meaningful patterns and insights.
In conclusion, data preprocessing is a critical step in the data mining process. It

involves cleaning, handling missing values, dealing with outliers, normalizing
the data, and reducing dimensionality. By performing these preprocessing steps,
analysts can ensure that the data is in a suitable format for analysis, leading to
more accurate and effective results. Without proper data preprocessing, the
analysis may be biased, inaccurate, or inefficient. Therefore, it is essential to
give proper attention to data preprocessing to enhance the quality and reliability
G H RAISONI

of data mining results.
What is data cleaning?

Ans: What is Data Cleaning in Data Mining?
Data mining is the process of extracting valuable insights and patterns from
large datasets. However, before this can be done, it is crucial to ensure that the
data being analyzed is of high quality and free from errors. This is where data
cleaning comes into play.
Data cleaning, also known as data cleansing or data scrubbing, is the process of
identifying and correcting or removing errors, inconsistencies, and inaccuracies
within a dataset. It involves detecting and resolving issues such as missing
values, duplicate records, outliers, and formatting errors. The goal is to
transform messy and raw data into clean, reliable, and consistent data that can be
effectively analyzed.
Why is Data Cleaning Important?
Data cleaning is an essential step in the data mining process. Here are a few
reasons why it is important:
Accurate Analysis: Clean data ensures accurate analysis and reliable results. If
the dataset is riddled with errors and inconsistencies, any insights or patterns
derived from it may be misleading or inaccurate. By cleaning the data, we can
ensure that the analysis is based on reliable information.
Improved Data Integration: In many cases, data mining involves combining

different datasets from various sources. Each dataset may have its own format,
structure, and quality issues. Data cleaning helps to standardize and align the
datasets, making integration and analysis much easier.
Enhanced Decision-Making: Data-driven decision-making is only effective

when the data being used is of high quality. Cleaning the data ensures that
decision-makers have access to accurate and reliable information, leading to
better-informed decisions and improved outcomes.
G H RAISONI

Common Data Cleaning Techniques:
Data cleaning can involve a variety of techniques and methods depending on the
specific issues present in the dataset. Here are some common techniques used in
data cleaning:
Removing Duplicates: Duplicate records can skew the analysis and produce
inaccurate results. Identifying and removing duplicate entries helps to eliminate
redundancy and ensures that each data point is unique.
Handling Missing Values: Missing values are a common issue in datasets and
can arise due to various reasons such as data entry errors or incomplete data
collection. Techniques such as imputation (replacing missing values with
estimated values) or deletion can be used to handle missing values appropriately.
Outlier Detection: Outliers are data points that deviate significantly from the
majority of the data. They can have a significant impact on the analysis, leading
to biased results. Identifying and handling outliers helps to ensure that the
analysis is not influenced by extreme values.
Standardization and Formatting: Datasets may have inconsistencies in terms of

formatting, units of measurement, or naming conventions. Standardizing the data
helps to ensure uniformity and compatibility across the dataset, making it easier
to analyze and interpret.
Conclusion:
Data cleaning is a critical step in the data mining process. It helps to ensure that
the data being analyzed is accurate, reliable, and consistent. By detecting and
resolving errors, inconsistencies, and inaccuracies, data cleaning allows analysts
to derive meaningful insights and patterns from the dataset. Effective data
cleaning techniques such as removing duplicates, handling missing values,
outlier detection, and standardization help to enhance the quality of the data and
improve the accuracy of analysis. Ultimately, data cleaning plays a vital role in
enabling data-driven decision-making and extracting valuable information from
large datasets.
Write about Data Cube technology.
G H RAISONI
Ans: Data mining is a powerful technique that involves extracting valuable
insights(Established
and patternsUnder
from large
UGC datasets. With thePradesh
(2f) and Madhya ever-increasing amount of (Sthapana evam
Niji Vishwavidyalaya
data beingSanchalan)
generated,Adhiniyam
traditional Act
dataNo.
mining
17 ofmethods are struggling
2007), Gram to keep Village-Saikheda,
Dhoda Borgaon, up
Teh-Saunsar, Dist.-Chhindwara, (M.P.)
with the complexity and scale of these datasets. This is where data cube– 480337
technology comes into play.

Data cube technology, also known as OLAP (Online Analytical Processing)
cube, is a multidimensional representation of data that allows for efficient and
effective analysis. It enables users to view data from various perspectives,
known as dimensions, and perform complex analysis by aggregating data along
those dimensions.
Write in brief data warehouse.

A data
Ans: A cube consists of is
data warehouse dimensions, measures,
a type of data and hierarchies.
management system thatDimensions
is designed to
represent the different attributes of the data, such as time, geography,
enable and support business intelligence (BI) activities, especially or product.
analytics.
Measures,
Data on the are
warehouses other hand,
solely are the to
intended numeric
performvalues thatand
queries areanalysis
being analyzed,
and often
such as sales, profits, or customer counts. Hierarchies represent the
contain large amounts of historical data. The data within a data warehouse is different
levels ofderived
usually granularity
from within
a wide each
rangedimension,
of sourcesallowing for drill-down
such as application and roll-up
log files and
operations.applications.
transaction
AData
datacube technology
warehouse offers several
centralizes advantageslarge
and consolidates overamounts
traditional data from
of data mining
methods. Firstly, it allows for faster query performance, as data cubes
multiple sources. Its analytical capabilities allow organizations to derive are pre-
computed and stored in a specialized data structure optimized for
valuable business insights from their data to improve decision-making. Overfast access.
This eliminates
time, the need to
it builds a historical perform
record thatcomplex calculations
can be invaluable on the
to data fly, resulting
scientists and in
significant time savings.
business analysts. Because of these capabilities, a data warehouse can be
considered an organization’s “single source of truth.”
A typical data warehouse often includes the following elements:
Secondly, data cube technology provides a comprehensive view of data by
enabling users to analyze it from multiple dimensions simultaneously. This
multidimensional analysis allows for a deeper understanding of the data and
A relational
uncovers database
hidden to store
patterns andand manage data
relationships that may not be apparent using
traditional data mining techniques.
An extraction, loading, and transformation (ELT) solution for preparing the data
for analysis
Statistical
Furthermore,analysis, reporting,
data cubes andthe
support data mining capabilities
exploration of data at different levels of
granularity.
Client Users
analysis can
tools fordrill down into
visualizing andthe data to view
presenting datadetailed information
to business users or
roll up to higher levels of aggregation to gain a broader perspective. This
Other, moreallows
flexibility sophisticated analytical
for in-depth applications
analysis at various that generate
levels, actionable
providing valuable
information by applying data
insights for decision-making. science and artificial intelligence (AI) algorithms,
or graph and spatial features that enable more kinds of analysis of data at scale
Organizations can also select a solution combining transaction processing, real-
Dataanalytics
time cube technology is widely
across data used in
warehouses various
and industries,
data lakes, including
and machine finance,
learning in
retail,
one healthcare,
MySQL and telecommunications.
Database service—without the For example,latency,
complexity, in retail,cost,
dataand
cubes
riskcan
beextract,
of used to transform,
analyze sales
andperformance by product, region, and time, identifying
load (ETL) duplication.
trends, popular products, and areas for improvement. In finance, data cubes can
help analyze investment portfolios based on asset class, risk level, and return,
enabling better portfolio management decisions.
Benefits of a Data Warehouse
Data warehouses offer the overarching and unique benefit of allowing
organizations
However, datatocube analyze large amounts
technology also hasofits
variant data and
limitations. Theextract significant
biggest challenge
value
lies infrom it, as well
the storage as to keep aashistorical
requirements, data cubes record.
can become extremely large,
especially when dealing with massive datasets. Managing and storing these
cubes can be a resource-intensive task, requiring significant computing power
and storage capabilities.
In conclusion, data cube technology is a powerful tool in the field of data

G H RAISONI
Four unique characteristics (described by computer scientist William Inmon,

who is considered the father of the data warehouse) allow data warehouses to
deliver this overarching benefit. According to this definition, data warehouses
are
Subject-oriented. They can analyze data about a particular subject or functional

area (such as sales).
Integrated. Data warehouses create consistency among different data types from
disparate sources.
Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t change.
Time-variant. Data warehouse analysis looks at change over time.
A well-designed data warehouse will perform queries very quickly, deliver high
data throughput, and provide enough flexibility for end users to “slice and dice”
or reduce the volume of data for closer examination to meet a variety of
demands—whether at a high level or at a very fine, detailed level. The data
warehouse serves as the functional foundation for middleware BI environments
that provide end users with reports, dashboards, and other interfaces.
Data Warehouse Architecture

The architecture of a data warehouse is determined by the organization’s specific
needs. Common architectures include
Simple. All data warehouses share a basic design in which metadata, summary
data, and raw data are stored within the central repository of the warehouse. The
repository is fed by data sources on one end and accessed by end users for
analysis, reporting, and mining on the other end.
Simple with a staging area. Operational data must be cleaned and processed
before being put in the warehouse. Although this can be done programmatically,
many data warehouses add a staging area for data before it enters the warehouse,
to simplify data preparation.
Hub and spoke. Adding data marts between the central repository and end users
allows an organization to customize its data warehouse to serve various lines of
business. When the data is ready for use, it is moved to the appropriate data
mart.
Sandboxes. Sandboxes are private, secure, safe areas that allow companies to
quickly and informally explore new datasets or ways of analyzing data without
having to conform to or comply with the formal rules and protocol of the data
G H RAISONI

warehouse.
The Evolution of Data Warehouses—From Data Analytics to AI and Machine
Learning
When data warehouses first came onto the scene in the late 1980s, their purpose
was to help data flow from operational systems into decision-support systems
(DSSs). These early data warehouses required an enormous amount of
redundancy. Most organizations had multiple DSS environments that served
their various users. Although the DSS environments used much of the same data,
the gathering, cleaning, and integration of the data was often replicated for each
environment.
As data warehouses became more efficient, they evolved from information

stores that supported traditional BI platforms into broad analytics infrastructures
that support a wide variety of applications, such as operational analytics and
performance management.
Explain Classification by Decision Tree Induction in Data Mining.

Ans: A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label. The topmost node in
the tree is the root node.
The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf node represents a class.
Decision Tree
The benefits of having a decision tree are as follows −
It does not require any domain knowledge.

It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5,
which was the successor of ID3. ID3 and C4.5 adopt a greedy approach. In this
algorithm, there is no backtracking; the trees are constructed in a top-down
recursive divide-and-conquer manner.
G H RAISONI
Generating a decision tree form training tuples of data partition D

Algorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then

return N as leaf node labeled with class C;
if attribute_list is empty then

return N as leaf node with labeled
with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list)

to find the best splitting_criterion;
label node N with splitting_criterion;
G H RAISONI

if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute

for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition

let Dj be the set of data tuples in D satisfying outcome j; // a partition
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due
to noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches

There are two approaches to prune a tree −
Pre-pruning − The tree is pruned by halting its construction early.
Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
The cost complexity is measured by the following two parameters −
G H RAISONI
Number of leaves in the tree, and

Error rate of the tree.
Discuss Rule Based Classification

Ans: IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We
can express a rule in the following from −
IF condition THEN conclusion

Let us consider a rule R1,
R1: IF age = youth AND student = yes

THEN buy_computer = yes
Points to remember −
The IF part of the rule is called rule antecedent or precondition.
The THEN part of the rule is called rule consequent.
The antecedent part the condition consist of one or more attribute tests and these
tests are logically ANDed.
The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)

If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN
G H RAISONI

rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm

Sequential Covering Algorithm can be used to extract IF-THEN rules form the
training data. We do not require to generate a decision tree first. In this
algorithm, each rule for a given class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per
the general strategy the rules are learned one at a time. For each time rules are
learned, a tuple covered by the rule is removed and the process continues for the
rest of the tuples. This is because the path to each leaf in a decision tree
corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for
one class at a time. When learning a rule from a class Ci, we want the rule to
cover all the tuples from class C only and no tuple form any other class.
Algorithm: Sequential Covering
Input:
G H RAISONI

D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
Output: A Set of IF-THEN rules.

Method:
Rule_set={ }; // initial set of rules learned is empty
for each class c do
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule_set=Rule_set+Rule; // add a new rule to rule-set

end for
return Rule_Set;
Rule Pruning
The rule is pruned is due to the following reason −
The Assessment of quality is made on the original set of training data. The rule
may perform well on training data but less well on subsequent data. That's why
the rule pruning is required.
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version
of R has greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule
R,
FOIL_Prune = pos - neg / pos + neg

G H RAISONI

where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set.
Hence, if the FOIL_Prune value is higher for the pruned version of R, then we
prune R.
Categorize Major Clustering Methods.
Ans: There exit a large number of clustering algorithms in the literature .The choice of clustering
algorithm depends both on the type of data available and on the particular purpose and application. If
cluster analysis is used as a descriptive or exploratory tool,it is possible to try several algorithms on the
same data to see what the data may disclose. In general, major clustering methods can be classified into
the following categories. 21.2 Partitioning methods: Given a database of n objects or data tuples,a
partition in method constructs k partitions of the data, where each partition represents cluster and
K<=n. That is ,it classifies the data into k groups, which together satisfy the following requirements: (1)
each group must contain at least on e object,and (2) each object must belong to exactly one group-
Notice that the second requirement can be relaxed in some fuzzy partitioning technique. Given K, the
number of partitions to construct , a partitining method creates an initial partitioning. It then uses an
iterative relocation technique that attempts to improve the partitioning by moving objects from one
group to another .The general criterion of a good partitioning is that objects in the same clusters are
"close" or related to each other,whereas objects of different clusters are "far apart"or very different.
there are various kinds of other criteria for judging the quality of partitions. To achieve global optimality
in partitioning-based clustering would require the exhaustive enumeration of all of the possible
partitions. Instead, most applications adopt one of two popular heuristic methods; 1. the k-means
algorithm,where each cluster is represented by the mean value of the objects in the cluster,and 2. the k-
medoids algorithm,where each cluster is represented by one of the objects located near the center of
the cluster.These heuristic clustering methods work well for finding spherical-shaped clusters in small to
medium -sized databases.To find clusters with complex shapes and for clustering very large data sets,
partitioning-based methods need to be extended.Partitioning-based clustering methods are studied in
depth later. 21.3 Hierarchical methods: A hierarchical method creates a hierarchical decomposition of
the given set of data objects,A hierarchical method can be classified as being either agglomerative or
divisive ,based on how the hierarchical decomposition is formed. The agglomerative approach,also
called the bottom -up aproach ,starts with each object forming a separate group, It successively merges
the objects or groups close to one another, until all of the groups are merged into one( the topmost
level of the hierarchy), or until a trmination condition holds. The divisive approach, also called the top-
down approach, starts with all the objects in the same cluster,until eventually each object is in one
cluster, or until a termination condition holds, Hierarchical methods suffer form the fact that once a
step(merge o9r split) is done,it can never be undone. This rigidity is useful in that it leads to smaller
computation costs by not worrying about a combinatorial number of different choices.However, a major
problem of such techniques is that they cannot correct erroneous decisions.There are two approaches
to improving the quality of hierarchical partitioning, such as in CURE and Chameleon, or (2) integate
hierarchical agglomeration and iterative relocation by first using a hierarchical agglomerative algorithm
and then refining the result using iterative relocation by first using a hierarchical aggomerative algorithm
and then refining the result using iterative relocation , as in BIRCH. 21.4 Density- based methods: most
G H RAISONI

partitioning methods cluster objects based on the distance between objects.Such methods can find only
spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. Other
clustering methods have been developed based on the notion of density.Their general idea is to
continue growing the given cluster as long as the density. Their general idea is to continue growing the
given cluster as long as the density(number of objects or data points)in the "neighborhood" exceeds
some threshold; that is , for each data point within a given cluster,the neighborhood of a given radius
has to contain at least a minimum number of points .Such a method can be used to filter out
noise(outliers) and discover clusters of arbitrary shape. DBSCAN is a typical density-based method that
grows clusters according to a density threshold,OPTICS is a density-based method that computes an
augmented clustering ordering for automatic and interactive cluster analysis. Grid -based method:Grid -
based methods quantize the object space into a finite number of cells that form a grid structure .All of
the clustering operations are performed on the grid structure(i.e., on the quantized space).The main
avantage of this approach is its fast processing time, which is typically independent of the number of
data objects and dependent only on the number of cells in each dimension in the quantized space.
STING is a typical example of a grid-based method.CLIQUE and wave-cluster are two clustering
algorithms that are both grid-based and density-based. model-based methods: Model-based methods
hypothesize a model for each of the clusters and find the best fit of the data to the given model. A
model- based methods hypothesize a model for each of the clusters and find the best fit of the data to
the given model .A model-baed algorithm may locatre clusters by constructing a density function that
reflects the spatial distribution of the data points.It also leads to a way of automatical determining the
number of clusters based on standard statistics, taking "noise"or outliers into account and thus yielding
robust clustering methods .Model-based clustering methods are studied below. Some clustering
algorithms integrate the ideas of several clustering methods,so that it is sometimes difficult to classify a
given algorithm as uniquely belonging to only one clusteing method category. Furthermore ,some
applications may have clustering creteria that require the integration of seeral clustering techniques. In
the following sections,we examine each of the above five clustering methods in detail. We also
introduce algorithms that integrate the ideas of several clustering methods.outlier analysis , which
typically involves clustering, is described at the end of this section. 21.5 Partitioning Methods Given a
database of objects and k , the number of clusters to form , a partitioning algorithm organizes the
objects into k partitions(k<=n), where each partition represents a cluster.The clusters are formed to
optimize an objective-partitioning criterion,often called a similarity function ,such as distance ,so that
the objects within a cluste are "similar," whereas the objects of different clusters are "dissimilar"in
terms of the database attributes. 21.6 Classical Partitioning Methods: k-means and k-medoids The most
well -known and commonly used partitioning methods are k-means,k-nedoids, and their variations.
Centroid-Based Technique: The K-Means method The fc-means algorithm takes the input paramete,k,
and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the
intercluster similarity is low.cluster similarity is measured in regard to the mean value of the objects in a
cluster, which can be viewed as the cluster's center of gravity. "How does the k-means algorithm
work ?" The k-means algorithm proceeds as follows.First, it randomly selects k of the objects, each of
which initially represents a cluster mean or center.For each of the remaiining objects, an object is
assigned to the cluster to which it is the most similar, based on the distance between the object and the
cluster mean.It then computes the new mean for each cluster.This process iterates until the criterion
function converges.Typically, the squared-error criterion is used,defined as E=sigma sigmap=ci|p-mi|
square where E is the sum of square-error for all obects in the database ,p is the point in space
representing a given object, and mi, is the mean of cluster ci ( both p and mi, are multidimensional).This
G H RAISONI

criterion tries to make the resulting k clusters as compact and as separate as possible. The algorithm
attempts to determine K partitions that minimize the squared-error function. It works when the clusters
are compact clouds that are rather well separated from one another.The method is relatively scalable
and efficient in processing large data sets because the computational complexity of the algorithm is
O(nkt), where n is the total number of objects,k is the number of clusters , and t is the number of
iterations .
What is Spatial Data Mining?
Ans: Spatial data mining is a specialized subfield of data mining that deals with
extracting knowledge from spatial data. Spatial data refers to data that is
associated with a particular location or geography. Examples of spatial data
include maps, satellite images, GPS data, and other geospatial information.
Spatial data mining involves analyzing and discovering patterns, relationships,
and trends in this data to gain insights and make informed decisions.
The use of spatial data mining has become increasingly important in various
fields, such as logistics, environmental science, urban planning, transportation,
and public health. By analyzing spatial data, researchers and data mining
professionals can identify correlations, predict future events, and make
informed decisions that can have a significant impact. For instance, a
transportation company can optimize its delivery routes for faster and more
efficient deliveries using spatial data mining techniques. They can analyze their
delivery data along with other spatial data, such as traffic flow, road network,
and weather patterns, to identify the most efficient routes for each delivery.
In the following sections, we'll answer questions about spatial data mining.
Types of Spatial Data
Different types of spatial data are used in spatial data mining. These include
point data, line data, and polygon data.
Point Data
Point data represents a single location or a set of locations on a map. Each

point is defined by its x and y coordinates, representing its position in the
geographic space. Point data is commonly used to represent geographic
features such as cities, landmarks, or specific locations of interest. Examples of
point data in transportation include delivery locations, bus stops, or railway
G H RAISONI

stations.
Line Data
Line data represents a linear feature, such as a road, a river, or a pipeline, on a

map. Each line is defined by a set of vertices, which represent the start and end
points of the line. Line data is commonly used to represent `transportation
networks, such as roads, highways, or railways. Line data is also used in other
areas, such as hydrology, geology, or ecology, to represent streams, faults, or
animal migration routes.
Polygon Data
Polygon data represents a closed shape or an area on a map. Each polygon is

defined by a set of vertices that connect to form a closed boundary. Polygon
data is commonly used to represent administrative boundaries, land use, or
demographic data. In transportation, polygon data can be used to represent
areas of interest, such as delivery zones or traffic zones.
In summary, point data represents a single location, line data represents a

linear feature, and polygon data represents an area or a closed shape.
Write about Multimedia Data Mining.
Ans: Multimedia mining is a subfield of data mining that is used to find

interesting information of implicit knowledge from multimedia databases.
Mining in multimedia is referred to as automatic annotation or annotation
mining. Mining multimedia data requires two or more data types, such as text
and video or text video and audio.
Multimedia data mining is an interdisciplinary field that integrates image

processing and understanding, computer vision, data mining, and pattern
recognition. Multimedia data mining discovers interesting patterns from
multimedia databases that store and manage large collections of multimedia
objects, including image data, video data, audio data, sequence data and
hypertext data containing text, text markups, and linkages. Issues in multimedia
data mining include content-based retrieval and similarity search,
generalization and multidimensional analysis. Multimedia data cubes contain
additional dimensions and measures for multimedia information.
The framework that manages different types of multimedia data stored,

delivered, and utilized in different ways is known as a multimedia database
management system. There are three classes of multimedia databases: static,
dynamic, and dimensional media. The content of the Multimedia Database
management system is as follows:
G H RAISONI

Media data:The actual data representing an object.
Media format data: Information such as sampling rate, resolution, encoding
scheme etc., about the format of the media data after it goes through the
acquisition, processing and encoding phase.
Media keyword data:Keywords description relating to the generation of data. It
is also known as content descriptive data. Example: date, time and place of
recording.
Media feature data: Content dependent data such as the distribution of colours,
kinds of texture and different shapes present in data.
Types of Multimedia Applications
Types of multimedia applications based on data management characteristics are:
Repository applications: A Large amount of multimedia data and meta-data
(Media format date, Media keyword data, Media feature data) that is stored for
retrieval purposes, e.g., Repository of satellite images, engineering drawings,
radiology scanned pictures.
Presentation applications: They involve delivering multimedia data subject to
temporal constraints. Optimal viewing or listening requires DBMS to deliver
data at a certain rate, offering the quality of service above a certain threshold.
Here data is processed as it is delivered. Example: Annotating of video and
audio data, real-time editing analysis.
Collaborative work using multimedia information involves executing a complex
task by merging drawings and changing notifications. Example: Intelligent
healthcare network.
Explain Text Mining.
Ans: Text mining (also known as text analysis), is the process of transforming unstructured text into
structured data for easy analysis. Text mining uses natural language processing (NLP), allowing machines
to understand the human language and process it automatically.
Mine unstructured data for insights
TRY NOW
For businesses, the large amount of data generated every day represents both an opportunity and a
challenge. On the one side, data helps companies get smart insights on people’s opinions about a
product or service. Think about all the potential ideas that you could get from analyzing emails, product
reviews, social media posts, customer feedback, support tickets, etc. On the other side, there’s the
dilemma of how to process all this data. And that’s where text mining plays a major role.
G H RAISONI
Like most things related to Natural Language Processing (NLP), text mining may sound like a hard-to-
grasp concept. But the truth is, it doesn’t need to be. This guide will go through the basics of text mining,
explain its different methods and techniques, and make it simple to understand how it works. You will
also learn about the main applications of text mining and how companies can use it to automate many
of their processes:
Getting started with text mining
How does text mining work?
Use cases and applications
Let’s jump right into it!
Getting Started With Text Mining
Text mining is an automatic process that uses natural language processing to extract valuable insights
from unstructured text. By transforming data into information that machines can understand, text
mining automates the process of classifying texts by sentiment, topic, and intent.
Thanks to text mining, businesses are being able to analyze complex and large sets of data in a simple,
fast and effective way. At the same time, companies are taking advantage of this powerful tool to
reduce some of their manual and repetitive tasks, saving their teams precious time and allowing
customer support agents to focus on what they do best.
Let’s say you need to examine tons of reviews in G2 Crowd to understand what customers are praising
or criticizing about your SaaS. A text mining algorithm could help you identify the most popular topics
that arise in customer comments, and the way that people feel about them: are the comments positive,
negative or neutral? You could also find out the main keywords mentioned by customers regarding a
given topic.
In a nutshell, text mining helps companies make the most of their data, which leads to better data-
driven business decisions.
At this point you may already be wondering, how does text mining accomplish all of this? The answer
takes us directly to the concept of machine learning.
G H RAISONI

Machine learning is a discipline derived from AI, which focuses on creating algorithms that enable
computers to learn tasks based on examples. Machine learning models need to be trained with data,
after which they’re able to predict with a certain level of accuracy automatically.
When text mining and machine learning are combined, automated text analysis becomes possible.
Going back to our previous example of SaaS reviews, let’s say you want to classify those reviews into
different topics like UI/UX, Bugs, Pricing or Customer Support. The first thing you’d do is train a topic
classifier model, by uploading a set of examples and tagging them manually. After being fed several
examples, the model will learn to differentiate topics and start making associations as well as its own
predictions. To obtain good levels of accuracy, you should feed your models a large number of examples
that are representative of the problem you’re trying to solve.
Now that you’ve learned what text mining is, we’ll see how it differentiates from other usual terms, like
text analysis and text analytics.
Write about Mining the World Wide Web
Ans: Over the last few years, the World Wide Web has become a significant source of information and
simultaneously a popular platform for business. Web mining can define as the method of utilizing data
mining techniques and algorithms to extract useful information directly from the web, such as Web
documents and services, hyperlinks, Web content, and server logs. The World Wide Web contains a
large amount of data that provides a rich source to data mining. The objective of Web mining is to look
for patterns in Web data by collecting and examining data in order to gain insights.
What is Web Mining?
Web mining can widely be seen as the application of adapted data mining techniques to the web,
whereas data mining is defined as the application of the algorithm to discover patterns on mostly
structured data embedded into a knowledge discovery process. Web mining has a distinctive property to
provide a set of various data types. The web has multiple aspects that yield different approaches for the
mining process, such as web pages consist of text, web pages are linked via hyperlinks, and user activity
can be monitored via web server logs. These three features lead to the differentiation between the
three areas are web content mining, web structure mining, web usage mining.
There are three types of data mining:

1. Web Content Mining:
Web content mining can be used to extract useful data, information, knowledge from the web page
content. In web content mining, each web page is considered as an individual document. The individual
G H RAISONI

can take advantage of the semi-structured nature of web pages, as HTML provides information that
concerns not only the layout but also logical structure. The primary task of content mining is data
extraction, where structured data is extracted from unstructured websites. The objective is to facilitate
data aggregation over various web sites by using the extracted structured data. Web content mining can be
utilized to distinguish topics on the web. For Example, if any user searches for a specific task on the
search engine, then the user will get a list of suggestions.
2. Web Structured Mining:
The web structure mining can be used to find the link structure of hyperlink. It is used to identify that data
either link the web pages or direct link network. In Web Structure Mining, an individual considers the
web as a directed graph, with the web pages being the vertices that are associated with hyperlinks. The
most important application in this regard is the Google search engine, which estimates the ranking of its
outcomes primarily with the PageRank algorithm. It characterizes a page to be exceptionally relevant
when frequently connected by other highly related pages. Structure and content mining methodologies are
usually combined. For example, web structured mining can be beneficial to organizations to regulate the
network between two commercial sites.
3. Web Usage Mining:
Web usage mining is used to extract useful data, information, knowledge from the weblog records, and
assists in recognizing the user access patterns for web pages. In Mining, the usage of web resources, the
individual is thinking about records of requests of visitors of a website, that are often collected as web
server logs. While the content and structure of the collection of web pages follow the intentions of the
authors of the pages, the individual requests demonstrate how the consumers see these pages. Web usage
mining may disclose relationships that were not proposed by the creator of the pages.

DWM - Notes Unit 1 To Unit 5

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWM - Notes Unit 1 To Unit 5

Uploaded by

Copyright:

Available Formats

G H RAISONI

Department of Computer UNIVERSITY

School of Engineering & Technology

Discuss Functionalities of Data Mining.

Ans: Data mining is a process that involves extracting useful information

One of the key functionalities of data mining is the ability to identify

Another important functionality of data mining is predictive modeling. By

School of Engineering & Technology

can make accurate predictions about future outcomes. This can be

Data mining also plays a crucial role in customer segmentation and

Furthermore, data mining enables businesses to perform market basket

In conclusion, the functionalities of data mining are extensive and

School of Engineering & Technology

Association Rule Mining: This type of data mining system focuses on

Classification and Prediction: Classification data mining systems are used to

School of Engineering & Technology

Anomaly Detection: Anomaly detection data mining systems focus on

In conclusion, data mining systems play a crucial role in extracting meaningful

Write about Data preprocessing.

School of Engineering & Technology

Data normalization is another crucial preprocessing step. It involves

Dimensionality reduction is also an integral part of data preprocessing in data

In conclusion, data preprocessing is a critical step in the data mining process. It

School of Engineering & Technology

What is data cleaning?

Why is Data Cleaning Important?

Improved Data Integration: In many cases, data mining involves combining

Enhanced Decision-Making: Data-driven decision-making is only effective

School of Engineering & Technology

Standardization and Formatting: Datasets may have inconsistencies in terms of

School of Engineering & Technology

Write in brief data warehouse.

In conclusion, data cube technology is a powerful tool in the field of data

School of Engineering & Technology

Four unique characteristics (described by computer scientist William Inmon,

Subject-oriented. They can analyze data about a particular subject or functional

Data Warehouse Architecture

School of Engineering & Technology

As data warehouses became more efficient, they evolved from information

Explain Classification by Decision Tree Induction in Data Mining.

It does not require any domain knowledge.

School of Engineering & Technology

Generating a decision tree form training tuples of data partition D

if tuples in D are all of the same class, C then

if attribute_list is empty then

apply attribute_selection_method(D, attribute_list)

School of Engineering & Technology

attribute_list = splitting attribute; // remove splitting attribute

// partition the tuples and grow subtrees for each partition

Tree Pruning Approaches

Pre-pruning − The tree is pruned by halting its construction early.

Post-pruning - This approach removes a sub-tree from a fully grown tree.

School of Engineering & Technology

Number of leaves in the tree, and

Discuss Rule Based Classification

IF condition THEN conclusion

R1: IF age = youth AND student = yes

The IF part of the rule is called rule antecedent or precondition.

The THEN part of the rule is called rule consequent.

The consequent part consists of class prediction.

Note − We can also write rule R1 as follows −

R1: (age = youth) ^ (student = yes))(buys computer = yes)