You are on page 1of 33

Unit-1

Introduction to Data Mining


Introduction to Data Mining
What Is Data Mining?
Data mining refers to extracting or mining knowledge from large amounts of data. The term
is actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data. It is the
computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems. The
overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use. The key properties of data
mining are Automatic discovery of patterns Prediction of likely outcomes Creation of
actionable information Focus on large datasets and databases
The Scope of Data Mining
Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of
store scanner data — and mining a mountain for a vein of valuable ore. Both processes
require either sifting through an immense amount of material, or intelligently probing it to
find exactly where the value resides. Given databases of sufficient size and quality, data
mining technology can generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviours. Data mining automates the process of
finding predictive information in large databases. Questions that traditionally required
extensive hands- on analysis can now be answered directly from the data — quickly. A
typical example of a predictive problem is targeted marketing. Data mining uses data on
past promotional mailings to identify the targets most likely to maximize return on
investment in future mailings. Other predictive problems include forecasting bankruptcy
and other forms of default, and identifying segments of a population likely to respond
similarly to given events.
Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern
discovery is the analysis of retail sales data to identify seemingly unrelated products that are
often purchased together. Other pattern discovery problems include detecting fraudulent
credit card transactions and identifying anomalous data that could represent data entry
keying errors.
Tasks of Data Mining
Data mining involves six common classes of tasks:
Anomaly detection (Outlier/change/deviation detection) – The identification of unusual
data records, that might be interesting or data errors that require further investigation.
Association rule learning (Dependency modelling) – Searches for relationships between
variables. For example a supermarket might gather data on customer purchasing habits.
Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression – attempts to find a function which models the data with the least error.
Summarization – providing a more compact representation of the data set, including
visualization and report generation.
Architecture of Data Mining
A typical data mining system may have the following major components.
1. Knowledge Base:
This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies, used
to organize attributes or attribute values into different levels of abstraction. Knowledge
such as user beliefs, which can be used to assess a pattern’s interestingness based on its
unexpectedness, may also be included. Other examples of domain knowledge are additional
interestingness constraints or thresholds, and metadata (e.g., describing data from multiple
heterogeneous sources).
2. Data Mining Engine:
This is essential to the data mining system and ideally consists of a set of functional
modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
3. Pattern Evaluation Module:
This component typically employs interestingness measures interacts with the data mining
modules so as to focus the search toward interesting patterns. It may use interestingness
thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module
may be integrated with the mining module, depending on the implementation of the
datamining method used. For efficient data mining, it is highly recommended to push the
evaluation of pattern interestingness as deep as possible into the mining process as to
confine the search to only the interesting patterns.
4. User interface:
This module communicates between users and the data mining system,allowing the user to
interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory datamining based on the intermediate
data mining results. In addition, this component allows the user to browse database and data
warehouse schemas or data structures,evaluate mined patterns, and visualize the patterns in
different forms.
Data Mining Process:
Data Mining is a process of discovering various models, summaries, and derived values
from a given collection of data.
The general experimental procedure adapted to data-mining problems involves the
following steps:
1. State the problem and formulate the hypothesis
Most data-based modeling studies are performed in a particular application domain. Hence,
domain-specific knowledge and experience are usually necessary in order to come up with
a meaningful problem statement. Unfortunately, many application studies tend to focus on
the data-mining technique at the expense of a clear problem statement. In this step, a
modeler usually specifies a set of variables for the unknown dependency and, if possible, a
general form of this dependency as an initial hypothesis. There may be several hypotheses
formulated for a single problem at this stage. The first step requires the combined expertise
of an application domain and a data-mining model. In practice, it usually means a close
interaction between the data-mining expert and the application expert. In successful data-
mining applications, this cooperation does not stop in the initial phase; it continues during
the entire data-mining process.
2. Collect the data
This step is concerned with how the data are generated and collected. In general, there are
two distinct possibilities. The first is when the data-generation process is under the control
of an expert (modeler): this approach is known as a designed experiment. The second
possibility is when the expert cannot influence the data- generation process: this is known
as the observational approach. An observational setting, namely, random data generation, is
assumed in most data-mining applications. Typically, the sampling distribution is
completely unknown after data are collected, or it is partially and implicitly given in the
data-collection procedure. It is very important, however, to understand how data collection
affects its theoretical distribution, since such a priori knowledge can be very useful for
modeling and, later, for the final interpretation of results. Also, it is important to make sure
that the data used for estimating a model and the data used later for testing and applying a
model come from the same, unknown, sampling distribution. If this is not the case, the
estimated model cannot be successfully used in a final application of the results.
3. Preprocessing the data
In the observational setting, data are usually "collected" from the existing databases, data
warehouses, and data marts. Data preprocessing usually includes at least two common
tasks:
1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement errors,
coding and recording errors, and, sometimes, are natural, abnormal values. Such non
representative samples can seriously affect the model produced later. There are two
strategies for dealing with outliers:
a. Detect and eventually remove outliers as a part of the preprocessing phase, or
b. Develop robust modeling methods that are insensitive to outliers.
2. Scaling, encoding, and selecting features – Data preprocessing includes several steps
such as variable scaling and different types of encoding. For example, one feature with the
range [0, 1] and the other with the range [−100, 1000] will not have the same weights in the
applied technique; they will also influence the final data-mining results differently.
Therefore, it is recommended to scale them and bring both features to the same
dimensionality reduction by providing a smaller number of informative features for
subsequent data modeling.
These two classes of preprocessing tasks are only illustrative examples of a large spectrum
of preprocessing activities in a data-mining process.
Data-preprocessing steps should not be considered completely independent from other data-
mining phases. In every iteration of the data-mining process, all activities, together, could
define new and improved data sets for subsequent iterations. Generally, a good
preprocessing method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling and encoding.
4. Estimate the model
The selection and implementation of the appropriate data-mining technique is the main task
in this phase. This process is not straightforward; usually, in practice, the implementation is
based on several models, and selecting the best one is an additional task. The basic
principles of learning and discovery from data are given in Chapter 4 of this book. Later,
Chapter 5 through 13 explain and analyze specific techniques that are applied to perform a
successful learning process from data and to develop an appropriate model.
5. Interpret the model and draw conclusions
In most cases, data-mining models should help in decision making. Hence, such models
need to be interpretable in order to be useful because humans are not likely to base their
decisions on complex "black-box" models. Note that the goals of accuracy of the model and
accuracy of its interpretation are somewhat contradictory. Usually, simple models are more
interpretable, but they are also less accurate. Modern data-mining methods are expected to
yield highly accurate results using high dimensional models. The problem of interpreting
these models, also very important, is considered a separate task, with specific techniques to
validate the results. A user does not want hundreds of pages of numeric results. He does not
understand them; he cannot summarize, interpret, and use them for successful decision
making.

The Data mining Process

The Process of Knowledge Discovery


What are Data Mining and Knowledge Discovery? With the enormous amount of data
stored in files, databases, and other repositories, it is increasingly important, if not
necessary, to develop powerful means for analysis and perhaps interpretation of such data
and for the extraction of interesting knowledge that could help in decision-making. Data
Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the
nontrivial extraction of implicit, previously unknown and potentially useful information
from data in databases. While data mining and knowledge discovery in databases (or KDD)
are frequently treated as synonyms, data mining is actually part of the knowledge discovery
process. The following figure shows data mining as a step in an iterative knowledge
discovery process.

The Knowledge Discovery in Databases process comprises of a few steps leading from raw
data collections to some form of new knowledge. The iterative process consists of the
following steps:
• Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection.
• Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
• Data selection: at this step, the data relevant to the analysis is decided on and retrieved
from the data collection.
• Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
• Data mining: it is the crucial step in which clever techniques are applied to extract
patterns potentially useful.
• Pattern evaluation: in this step, strictly interesting patterns representing knowledge are
identified based on given measures.
• Knowledge representation: is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to help
users understand and interpret the data mining results. It is common to combine some of
these steps together. For instance, data cleaning and data integration can be performed
together as a pre-processing phase to generate a data warehouse. Data selection and data
transformation can also be combined where the consolidation of the data is the result of the
selection, or, as for the case of data warehouses, the selection is done on transformed data.
The KDD is an iterative process. Once the discovered knowledge is presented to the user,
the evaluation measures can be enhanced, the mining can be further refined, new data can
be selected or further transformed, or new data sources can be integrated, in order to get
different, more appropriate results. Data mining derives its name from the similarities
between searching for valuable information in a large database and mining rocks for a vein
of valuable ore. Both imply either sifting through a large amount of material or ingeniously
probing the material to exactly pinpoint where the values reside. It is, however, a misnomer,
since mining for gold in rocks is usually called “gold mining” and not “rock mining”, thus
by analogy, data mining should have been called “knowledge mining” instead.
Nevertheless, data mining became the accepted customary term, and very rapidly a trend
that even overshadowed more general terms such as knowledge discovery in databases
(KDD) that describe a more complete process. Other similar terms referring to data mining
are: data dredging, knowledge extraction and pattern discovery.
Predictive and descriptive Predictive Techniques
Predictive Data Mining:
The main goal of this mining is to say something about future results not of current
behaviour. It uses the supervised learning functions which are used to predict the target
value. The methods come under this type of mining category are called classification,
time-series analysis and regression. Modelling of data is the necessity of the predictive
analysis, and it works by utilizing a few variables of the present to predict the future not
known data values for other variables.
Descriptive Data Mining:
This term is basically used to produce correlation, cross-tabulation, frequency etc. These
technologies are used to determine the similarities in the data and to find existing patterns.
One more application of descriptive analysis is to develop the captivating subgroups in the
major part of the data available.
This analytics emphasis on the summarization and transformation of the data into
meaningful information for reporting and monitoring.

Difference between Descriptive and Predictive Data Mining:

S.No. Comparison Descriptive Data Mining Predictive Data Mining

It determines, what can


It determines, what happen in the future with
happened in the past by the help past data
1. Basic analyzing stored data. analysis.

It provides accurate It produces results does


2. Preciseness data. not ensure accuracy.

Practical Standard reporting, Predictive modelling,


analysis query/drill down and forecasting, simulation
3. methods ad-hoc reporting. and alerts.

It requires data
aggregation and data It requires statistics and
4. Require mining forecasting methods

Type of
5. approach Reactive approach Proactive approach

Carry out the induction


Describes the over the current and past
characteristics of the data so that predictions
6. Describe data in a target data set. can be made.

7. Methods(in • what happened? • what will happen next?


general) • where exactly is the • what is the outcome if
problem? these trends continue?
• what is the frequency • what actions are
of the problem? required to be taken?

Predictive and Descriptive Data mining Techniques


Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks.
Data mining tasks can be classified into two categories:
Predictive: These tasks predict the value of one attribute on the bases of values of other
attributes. E.g. : Customer/Product prediction at sales store
Descriptive: These tasks present the general properties of data stored in database.
The descriptive tasks are used to find out patterns in data.
E.g. : Cluster, correlation, trends etc.
Data Mining Functionalities Prediction
Classification and Prediction:
“How is the derived model presented?” The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae, or
neural networks. Decision tree is a flow-chart-like tree structure, where each node
denotes a test on an attribute value, each branch represents an outcome of the test, and
tree leaves represent classes or class distributions.
Cluster Analysis:
In classification and prediction analyze class-labeled data objects, The objects are
grouped based on the principle of maximizing the intraclass similarity and minimizing
the interclass similarity

Outlier Analysis:
A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions. The analysis of outlier data is referred to as outlier
mining.

Data Mining Functionalities Descriptive:


Concept/Class Description: Characterization and Discrimination Data can be associated
with classes or concepts. For example, in the Electronics store, classes of items for
sale include computers and printers, and concepts of customers include big Spenders and
budget Spenders.
Data characterization: Data characterization is a summarization of the general
characteristics or features of a target class of data.
Data discrimination: Data discrimination is a comparison of the general features of
target class data objects with the general features of objects from one or a set of
contrasting classes.
Supervised and unsupervised data mining techniques:
Supervised learning:
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train the machine using data that is well
labeled. Which means some data is already tagged with the correct answer. After that, the
machine is provided with a new set of examples(data) so that the supervised learning
algorithm analyses the training data(set of training examples) and produces a correct
outcome from labeled data.

Supervised learning classified into two categories of algorithms:


Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” or “disease” and “no disease”.

• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data
is already tagged with the correct answer.

Types:-
• Regression
• Logistic Regression
• Classification
• Naive Bayes Classifiers
• K-NN (k nearest neighbors)
• Decision Trees
• Support Vector Machine
Advantages:-
• Supervised learning allows collecting data and produces data output from previous
experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-world computation
problems.
Disadvantages:-
• Classifying big data can be challenging.
• Training for supervised learning needs a lot of computation time. So, it requires a lot of
time.

Steps

Unsupervised learning:
Unsupervised learning is the training of a machine using information that is neither
classified nor labelled and allowing the algorithm to act on that information without
guidance. Here the task of the machine is to group unsorted information according to
similarities, patterns, and differences without any prior training of data.

Unlike supervised learning, no teacher is provided that means no training will be given to
the machine. Therefore the machine is restricted to find the hidden structure in unlabeled
data by itself.

It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.

Unsupervised learning is classified into two categories of algorithms:


Clustering: A clustering problem is where you want to discover the inherent groupings
in the data, such as grouping customers by purchasing behavior.

• Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning

Supervised machine Unsupervised machine


Parameters learning learning

Algorithms are trained using Algorithms are used against data


Input Data labeled data. that is not labeled

Computational
Complexity Simpler method Computationally complex

Accuracy Highly accurate Less accurate

Major Issues in Data Mining

Data mining issues can be classified into five categories:

1. Mining Methodology

2. User Interaction

3. Efficiency and Scalability

4. Diversity of Database Types


5. Data Mining and Society

1) Mining Methodology

Mining various and new kinds of knowledge

Data mining covers a wide spectrum of data analysis and knowledge discovery
tasks, so these tasks may use the same database in different ways and require the
development of numerous data mining techniques.

Mining knowledge in multidimensional space

When searching for knowledge in large data sets, we can explore the data in
multidimensional space.

That is, we can search for interesting patterns among combinations of dimensions
(attributes) at varying levels of abstraction. Such mining is known as (exploratory)
multidimensional data mining.

Data mining—an interdisciplinary effort

The power of data mining can be substantially enhanced by integrating new


methods from multiple disciplines.

For example, to mine data with natural language text, it makes sense to fuse data
mining methods of information retrieval and natural language processing.

Handling uncertainty, noise, or incompleteness of data

Data often contain noise, errors, exceptions, uncertainty or incomplete.

Errors and noise may confuse the data mining process, leading to the derivation of
erroneous patterns.

2) User Interaction

Interactive mining

The data mining process should be highly interactive. Thus, it is important to build
flexible user interfaces and an exploratory mining environment, facilitating the
user’s interaction with the system.

Incorporation of background knowledge


Background knowledge, constraints, rules, and other information regarding the
domain under study should be incorporated into the knowledge discovery process.

Presentation and visualization of data mining results

How any system can present data mining results, vividly(clear image in mind) and
flexibly ?, so that the discovered knowledge can be easily understood and
directly usable by humans.

3) Efficiency and Scalability

Efficiency and scalability of data mining algorithms

Data mining algorithms must be efficient and scalable in order to effectively extract
information from huge amounts of data lies in many data repositories or in dynamic
data streams. In other words, the running time of a data mining algorithm must be
predictable, short, and acceptable by applications. Efficiency, scalability,
performance, optimization, and the ability to execute in real time are key criteria for
new mining algorithms.

Parallel, distributed, and incremental mining algorithms

The giant size of many data sets, the wide distribution of data, and the
computational complexity of some data mining methods are factors that motivate
the development of parallel and distributed data-intensive mining algorithms.

4) Diversity of Database Types

Handling complex types of data

Data mining is how to uncover knowledge from stream, time-series, sequence,


graph, social network, and multi relational data.

In mining various types of attributes are available and also different types of data in
database or dataset.
Mining dynamic, networked, and global data repositories

Data from multiple sources are connected by the Internet and various kinds of
networks like distributed and heterogeneous global information systems. The
discovery of knowledge from different sources of structured, semi- structured, or
unstructured challengeable. Web Mining, multisource data mining and information
network mining have become challenging and fast-evolving data mining fields.

5) Data Mining and Society

Social impacts of data mining

With data mining penetrating our everyday lives, it is important to study the impact
of data mining on society, How can we used at a mining technology to benefit our
society? How can we guard against its misuse?

Privacy-preserving data mining

Data mining will help in scientific discovery, business management, economy


recovery, and security protection (e.g., the real-time discovery of intruders and
cyber attacks). However, it poses the risk of disclosing an individual’s personal
information.

Invisible data mining:We cannot expect everyone in society to learn and master in
data mining techniques. For example, when purchasing items online, users may be
unaware that the store is likely collecting data on the buying patterns of its
customers, which may be used to recommend other items for purchase in the future.

Data objects and attributed types: An attribute is a property of the object.


Attribute type:An attribute is a property of the object. It also represents different
features of the object. Name, Age, Qualification etc. E.g. Person à Attribute types
can be divided into four categories.

1)Nominal

2)Ordinal

3)Interval

4)Ratio
Nominal attribute

Nominal attributes are named attributes which can be separated into discrete
(individual) categories which do not overlap.

Nominal attributes values also called as distinct values.

Example

Ordinal attribute

Ordinal attribute is the order of the values, that’s important and significant, but the
differences between each one is not really known.

Example

Rankings à 1st, 2nd, 3rd

Ratings à ,

We know that a 5 star is better than a 2 star or 3 star, but we don’t

know and cannot quantify–how much better it is?

Interval attribute

Interval attribute comes in the form of a numerical value where the difference between
points is meaningful.
Example

Temperature à 10°-20°, 30°-50°, 35°-45°

Calendar Dates à 15th – 22nd, 10th – 30th

We cannot find true zero (absolute) value with interval attributes.

Ratio attribute

Ratio attribute is looks like interval attribute, but it must have a true zero (absolute)
value.

It tells us about the order and the exact value between units or data.

Example

Age Group à 10-20, 30-50, 35-45 (In years)

Mass à 20-30 kg, 10-15 kg

It does have a true zero (absolute) so, it is possible to compute ratios.

Statistical Description of Data:

• Covers numerical measures used as descriptive statistics

• Box plots (a.k.a. box-and-whisker plots) are introduced (separate vignette)

• Not all topics in the text will be covered in this vignette

• Describe data using measures of central tendency and dispersion: for a set of
individual data values, and for a set of grouped data.

• Use the computer to visually represent data.

• Use the coefficient of correlation to measure association between two quantitative


variables.

Learning Objectives

• Describe data using measures of central tendency and dispersion:

for a set of individual data values, and for a set of grouped data.

• Use the computer to visually represent data.


• Use the coefficient of correlation to measure association between two quantitative
variables

Shape – Center – Spread

• When we gather data, we want to uncover the “information” in it. One easy way to
do that is to think of: “Shape –Center- Spread”

• Shape – What is the shape of the histogram?

• Center – What is the mean or median?

• Spread – What is the range or standard deviation?

Key Terms
Mean:
Mean is the average of a dataset.
To find the mean, calculate the sum of all the data and then divide by the total
number of data.
Example
✔Find out mean for 12, 15, 11, 11, 7, 13

Median
Median is the middle number in a dataset when the data is arranged in numerical order
(Sorted Order).
Median(Odd)

▪ Example
✓ Find out Median for 12, 15, 11, 11, 7, 13, 15

In above example, count of data is 7. (Odd)


First, arrange the data in ascending order.
7, 11, 11, 12, 13, 15, 15

7, 11, 11, 12, 13, 15, 15

Median - Even
Example:
Find out median for 12,15,11,11,7,13

In above example, count of data is 6. (Even)


First, arrange the data in ascending order.
7, 11, 11, 12, 13, 15
Calculate an average of the two numbers in
the middle.
7, 11, 11, 12, 13, 15
(11 + 12)/2 = 11.5 Median
Mode
The mode is the number that occurs most often within a set of numbers.
Example:

Mode
Mode

Range
The range of a set of data is the difference between the largest and the smallest number
in the set.
Example
Find range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50

▪ In our example largest number is 55, and subtract the smallest


number is 26.
Standard Deviation

▪ The Variance is defined as:


The average of the squared differences from the Mean.

To calculate the variance follow these steps:

1. Calculate the mean, x.


2. Write a table that subtracts the mean from each
observed value.
3. Square each of the differences, add this column.
4. Divide by n -1 where n is the number of items in the
sample, this is the variance (In actual case take n).
5. To get the standard deviation we take the square root of
the variance.
Standard Deviation-example

▪ The owner of the Indian restaurant is interested in how much


people spend at the restaurant.
▪ He examines 10 randomly selected receipts for parties and writes
down the following data.
44, 50, 38, 96, 42, 47, 40, 39, 46, 50
1. Find out Mean (1st step)
✓ Mean is 49.2
2. Write a table that subtracts the mean from each observed value. (2nd step)

Standard deviation can be thought of measuring how far the data values lie from the
mean, we take the mean and move on standard deviation in either direction.
The mean for this example is 49.2 and the standard deviation is 17.
Now, 49.2 - 17 = 32.2 and 49.2 + 17 = 66.2
This means that most of the data probably spend between 32.2 and 66.2.
If all data are same then variance & standard deviation is 0 (zero).
Data Visualization
Data type to be visualized
1)1-D data, usually the dimension is very dense.
E.g. temporal data, like time series of stock prices.
2)2-D data.
E.g. geographical maps
3)Multi-Dimension
E.g. tables from relational databases
No simple mapping of attributes to the two dimensions of the screen
4) Text and hypertext, e.g. news articles
Most of the standard visualization techniques cannot be applied. In most cases, a
transformation of the data into description vectors is necessary first.
E.g. word counting, then principal component analysis.
5) Hierarchies and graphs
E.g. telephone calls
6) Algorithms and software
E.g. for debugging operations
Visualization technique
1) standard 2D/3D displays
e.g. bar charts and x-y plots.
2) geometrically transformed displays
e.g. parallel coordinates.
3) icon-based displays (glyphs)
4) dense pixel displays
5) stacked displays
Tailored to present data partitioned in a hierarchical fashion.
Embed one coordinate system inside another coordinate system.
. Interaction and distortion technique
Dynamic: changes to visualizations are made automatically
Interactive: changes are made manually
1) Dynamic projections
e.g. To show all interesting two-dimensional projections of a multi-dimensional dataset as a
series of scatter plots.
2) Interactive filtering
browsing: direct selection of desired subset
querying: specify properties of desired subsets
Interactive zooming
On higher zoom levels, more details are shown.
4) Interactive distortion
Show portions of the data with high level of detail while other s are shown with lower.
E.g. spherical distortion and fisheye views.
5) Interactive Linking and Brushing
– Combine different visualization methods to overcome the shortcomings of
single techniques.
– Changes to one visualization are automatically reflected in the other
visualization.
Measuring Data Similarity and Dissimilarity
– Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, which is typically metric: d(i, j)
– There is a separate “quality” function that measures the “goodness” of a cluster.
– The definitions of distance functions are usually very different for interval-scaled,
boolean, categorical, ordinal and ratio variables.
– Weights should be associated with different variables based on applications and data
semantics.
– It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.

Interval-scaled variables:
Binary variables:
Nominal, ordinal, and ratio variables:
Variables of mixed types
Interval-valued variables
Standardize data
Calculate the mean absolute deviation (MAD):

Where

Calculate the standardized measurement (z-score)

Using mean absolute deviation is more robust than using standard deviation
Similarity and Dissimilarity between Objects

◼ f q = 2, d is Euclidean distance:

◼ Properties
◼ d(i,j)  0
◼ d(i,i) = 0
◼ d(i,j) = d(j,i)
◼ d(i,j)  d(i,k) + d(k,j)
◼ Also one can use weighted distance, parametric Pearson product moment correlation,
or other dissimilarity measures.
A contingency table for binary data
Object j

Object i
imple matching coefficient (invariant, if the binary variable is symmetric):

Jaccard coefficient (noninvariant if the binary variable is asymmetric):

Dissimilarity between Binary Variables


Example

gender is a symmetric attribute


the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0

Nominal Variables
A generalization of the binary variable in that it can take more than 2 states, e.g., red,
yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
Method 2: use a large number of binary variables
creating a new binary variable for each of the M nominal states
Ordinal Variables
An ordinal variable can be discrete or continuous
order is important, e.g., rank
Can be treated like interval-scaled

replacing xif by their rank


map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

Compute the dissimilarity using methods for interval-scaled variables


Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at
exponential scale, such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables — not a good choice! (why?)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank as interval-scaled.
Variables of Mixed Types
A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.
One may use a weighted formula to combine their effects.

f is binary or nominal:
dij(f) = 0 if xif = xjf or missing values, or dij(f) = 1 o.w.
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled

compute ranks rif and


and treat zif as interval-scaled

You might also like