You are on page 1of 65

Visual Analytics

Exam Preparation - Summer 2020

DISCLAIMER
This is NOT an official document provided by the lecturer or associates. They are also
not based on passed exams. This means they can be way above or below the
difficulty of the actual exam and wording can differ. The provided answers are not
necessarily correct. I thought of those questions (and the answers) myself and use
them to deepen my understanding of the topic. I just think they can be helpful to
others too.

General
1. What is Visual Analytics (VA)? Why would you use it? What is the main aim of
VA?

VA is an ​integrated combination of ​data analytics and i​ nteractive visual


exploration​. This means the analysis process is a loop between (semi-)a ​ utomatic
analytics and i​ nteractive visualization​. For instance, the result of the analytic
process can be filtered or viewed from different perspectives with different
visualization techniques.

The ​data sets are very large and complex. T ​ his information overload means we
can get lost in data, that might be irrelevant or we process and present it in an
inappropriate way.

The main aim (or core objective) is to get ​insight into the data (hence the focus on
exploration), eg. to understand t​ rends, patterns, relations and new hypotheses​.
We want e ​ ffective understanding, reasoning and decision making​. Usability and
performance are not the main requirements the techniques try to meet.

2. Name the core objectives for VA and give at least 1 example for each of them.

Trends​: climate change, share price (deutsch: Aktienkurs), infection rate with corona
virus

Patterns:​ day and night cycle in energy consumption, increased sells of christmas

1
themed goods in November and December, long time climate cycles

Relations​: relation between exercise and decreased risk of cardiovascular diseases,


level of education and number of children

New Hypothesis​: wealth also influences number of children, energy consumption


also depends on seasons not just day/night, the online semester will influence how
well students do in the exam

3. For what kind of data is VA not useful and why? Give 2 examples and explain
why they fit into this category.

VA is not useful if the data is of ​high quality and the data size is moderate​. This
means we do not need to gain new insight and can just visualize the data without
additional analytics. The insight will be obvious. An example is weather data where
we only have temperature and a date. We can put this into a simple time-temperature
diagram. Another example is the distribution of grades in a university course. We can
simply put it in a bar plot. This is easily interpretable and does not contain erroneous
data.

​ ell-defined problems that can be assessed by


The other kind of data stems from w
non-interactive means.​ We can simply use statistical methods or optimization. An
example would be the prediction of visitor numbers (based on the current number of
attendees) or computing how much of the company's sales are influenced by newly
introduced advertisements. Statistical methods might fail if the data is not well
represented by statistical measures. Eg. bimodal data is not well represented by an
average or median.

Most data does not fall in those categories, so VA is needed.

4. For what data is VA useful? Give an example for each of them.

large data sets:​ crime statistics, census data, Amazon product data base

incomplete data:​ people skipping questions in surveys, sensor failures in


measurements

high dimensional data:​ medical data, movie/music/game data bases, nutritional


value of food

inconsistent data:​ same person registering multiple times for the same service, 2
sensors measuring the same thing but get different results, results of non
deterministic algorithm run with the same parameters (eg due to random seeding)

dynamic data​: development in financial data

2
5. What does Scalability and Reproducibility mean in the context of VA? Why is
it important to assess techniques according to those criteria?

Scalability: First we have v ​ isual scalability​. This means ​how well does the
visualization scale if we have a lot of data points​. Scatterplots, for instance, can
hold 100th of data points in one diagram, but a pie chart is limited to 10-15. It is also
closely related to issues such as ​overplotting and visual clutter.​ Then we have
display scalability, which means how well can we use the visualization on different
display sizes.
Often we need to choose techniques that can support lots of data points because VA
deals mostly with large data sets.

Reproducibility​: Reproducibility means we can reproduce a result. This means we


have a method and want to get the same result if we use it on the same data with the
same parameters. This is important so we can have ​undo/redo operations and to
have ​comparable results.​ We can also see that the r​ esult is inherent to the data
and not a result of chance.
For redo/undo, it is challenging to store all intermediate results and parameters, so
there is often a maximal number of steps we can go back and forth. If the result is
reproducible, we just have to know what interaction we did and how to revert it.
A method should produce the same result on a dataset, so we can compare the
results to each other and find similarities between data sets. With non deterministic
methods, we may have very different results although one data set has a similar
pattern than another.

6. Sketch the VA System and explain its components in an example.

Data​:
Models:​
Visualization:​
Knowledge​:

7. If we only have raw data, we often need to perform certain operations on it.
How is this step called? What is its purpose? What do we gain from it?

This step is called ​filtering or p


​ reprocessing.​ Some techniques are sensitive to
outliers, noise or incomplete data, so we need to decide how to deal with this (​data
cleansing​). We can, for instance, choose to smooth the data, remove outliers or

3
interpolate missing data.
In this step we can also do ​analytical computations to obtain aggregated data (eg
summary statistics), generate derived data like rate of change or create a hierarchy
(eg clustering)​.

8. Discuss flexibility and guidance with respect to VA.

For VA, we have a lot of tools at hand. Each of those tools have different parameters
and parameter combinations and can be combined or used alone. This gives us
great flexibility in choosing the right tools for the data at hand. But it also means the
analyst needs a lot of background knowledge how those tools work and how a
parameter influences the result. Especially for high dimensional data and large data
sets, it is not obvious what to choose to approach it.
We can therefore guide the analyst to help them choose a tool or fitting parameter
combination. This can be done by automatically computing statistical scores of the
data, give an overview of possible techniques, use estimated parameters based on
data size or give an overview of the data.
Flexibility is very important, as there are no techniques that perform well for every
data and we need to give the opportunity to compare different techniques. Flexibility
also gives us the opportunity to interact with the data. Guidance often gives a good
starting point and should not be too intrusive.

9. What is visual analytics? Comment on related subdisciplines and how they


interact with each other. Use terms, such as „hypothesis“, „finding“,
„confirmation“. What is the role of the analyst?

10. Describe the Rank-by-feature framework by (Seo, 2004) 1. Use the term
„interestingness“ measure. What are interestingness measures for individual
dimensions and pairs of dimensions?

The Rank-by-feature framework provides an ​overview of correlations​, e.g. with a


​ verlay statistical measures and rank the correlations,​
scatterplot matrix, can o
e.g. by certain types of regression or statistical power. One can ​explore relations​.
Select a type of relation, present the permutation matrix, provide a list of related
attribute pairs and show the selected pair in the scatterplot. It is an example for
metrics-based analysis.​
A major drawback of rankings is that margin between two items is not conveyed. Are
the Top-N almost equal or is there a „boundary“ where the above items have much
higher scores? Usually, we want diversity in our solutions to gain more insight.

The ranking is based on some interestingness measures. (In the whole paper the
word "interestingness" is only present once. The ranking criterias they use are not
called interestiness, so I don't know if he wants measures we learned about in the

1
https://www.cs.umd.edu/hcil/hce/presentations/seo_shneiderman_rff_ivs.pdf

4
class or the ones from the paper. I will list the ones from the paper.)
individual dimensions​: Normality of the distribution, Uniformity of the distribution,
number of potential outliers, number of unique values, Size of the biggest gap
pairs of dimensions:​ Correlation coefficient, Least square error for curvilinear
regression, Quadracity, number of potential outliers, number of items in the region of
interest, Uniformity of scatterplots

11. Imagine that there are a couple of visual analytics systems, e.g. in finance or
for journalists. How can these systems be compared with each other? In other
words: How can we systematically evaluate visual analytics systems?

(I couldn't really find anything in the slides for this. So I'll just write what I
think is fitting. If anyone can point out the slide from the lecture, feel free to
edit).
VA systems should primarily be judged by how good they are to produce insight and
hypotheses, not based on usability or performance. A VA system should be able to
use different techniques and combinations of those to give flexibility but it should also
be able to give guidance.

Prof. Preim also said we can compare them on basis of questions asked eg. waht do
we want to find out.

Global Clustering
12. What is Clustering? Why would we need clustering?

Clustering is part of exploratory data analysis. A ​cluster is a group of items that


are more similar to each other than they are to members of other clusters (the
intracluster coherence is maximized). Similarity is defined by a metric (eg. euclidean
distance). A cluster summarizes a h​ omogeneous region in the data. ​Clustering is
then the process of grouping itself​. It can be performed automatically.

Clustering helps us ​understand the structure of the data​. We can use it to partition
data and as preprocess for classification or focus-and-context visualizations.

13. What is "global" clustering? When does it start to fail?

Global means we c ​ luster the data in every dimension​. So if we have a


30-dimensional data set, we would look for objects that are similar in all 30
dimensions. Global clustering ​begins to fail at around 10-15 dimensions already
because the higher the dimensionality, the less likely a multitude of objects are
similar in all dimensions. This is why we use subspace clustering.

5
14. How would we decide what clustering method to use?

When the distribution of the data is unknown, we use c ​ lustering with different
methods and parameters u ​ ntil results are plausible​. Selection of a clustering
method is b ​ ased on assumptions about the data and a clustering model. The
clustering model can be chosen based on the ​distribution of data, the expected
shapes of clusters and their relations​.

15. What different kinds of clustering can we perform? Explain them in about 1
sentence.

hierarchical or non-hierarchical​: Clusters can be ordered into a hierarchy, so we


have clusters containing clusters. The hierarchy can be built bottom up or top down.

fuzzy or binary: A fuzzy clustering only gives us a percentage of how likely it is an


item belongs to a certain cluster, while binary clustering partitions the data in
non-overlapping parts (hard clusters).

deterministic or non deterministic:​ deterministic clustering always yields the same


result if applied to the same data with the same parameters

with various distance measures​: we can define similarity differently by choosing


another distance measure (eg euclidean, city block,...).

16. What are the results of a clustering?

hard or fuzzy clusters, cluster representatives, hulls, regression lines

17. What is the basic concept of fuzzy clustering?

A fuzzy clustering only gives us a percentage of how likely it its an item belongs to a
certain cluster, while binary clustering partitions the data in non-overlapping parts
(hard clusters)

18. What are the requirements for a clustering method?

An ideal clustering method is ​scalable to many objects and dimensions, can deal
​ rbitrary shape,​ is ​robust against noise and outliers and creates
with clusters of a
plausible results.​

19. What are the different clustering methods/paradigms? Explain their general
idea in 1-2 sentences and name an example for each of them.

distance model​: Objects belong to a cluster if they are closer to a cluster center i
than they are to a cluster center j. ​k-means

6
density model:​ Objects belong to a cluster if their local density is higher compared
to the average density. ​DB-SCAN, OPTICS

hierarchy model​: Clusters are assumed to exist at different levels. (We had no
named example, we only divided them in top-down (divisive) and bottom-up
(agglomerative))

20. K-means clustering


a. What is the model, k-means is based on?

​ istance model
k-means a clustering method based on a d

b. Explain the idea of k-means.

k-means p ​ artitions the data set into k groups​, this means we have to
approximate the number of clusters beforehand. It tries to find k centroids,
one for each cluster, which ​minimize the distance of associated data points.
Mathematically speaking, we want to minimize:
k
J = ∑ ∑ ||xj − μi ||2
i=0 xj ∈S i

where S i ...cluster i, xj …jth element of cluster i, μi … centroid of cluster i


The boundaries of a cluster correspond to the Voronoi decomposition of the
cluster centroids

c. Does k-means fulfill the requirements we have for a clustering method?


Explain your answer.

scalable​: k-means can be implemented very efficiently and can therefore be


used also with large datasets and high dimensions 2

arbitrary shape:​ One of k-means biggest drawbacks is, that it preferes


spherical and similar sized clusters. It also does not support split or
overlapping clusters.

robustness:​ k-means is not robust against outliers. It always assigns


potential outliers to a cluster. Due to random initialization, it is also non
deterministic and can converge to local minima

plausible results:​ Due to not handling arbitrary shape, results can be


unexpected 3

2
​https://stats.stackexchange.com/a/183213
3
​https://en.wikipedia.org/wiki/K-means_clustering#Discussion

7
d. What are the parameters we can adjust for k-means?

k: determines the cluster number. Can be estimated with the elbow-method


or the silhouette method. If k is too high, clusters get separated that should
belong together, if it is too small, we get a wrong assignment.

distance metric:​ euclidean distance can be used for numerical data, but we
could also use city block distance or something else

initial centroids: i​ n extensions of the k-means algorithm a new centroid may


be chosen more likely if its distance to the other centroids is large
(k-means++) 4

e. Which of the following diagrams is a valid k-means clustering (to the


best of your judgement)?

yes

no

21. DB-SCAN (density based spatial clustering of applications with noise)


a. What is the model DB-SCAN is based on?

​ ensity based​ method.


DB-SCAN is a d

b. Explain how the DB-SCAN algorithm works?

DB-SCAN bases clusters on the local density of an object's surroundings.


Two objects are density connected, if there exists a chain of dense objects
between them. An object is dense if there is another object a maximum
4
https://en.wikipedia.org/wiki/K-means%2B%2B

8
​ t least m points density connected, they
distance of ε away. If there are a
form a cluster​. The other points are labeled outliers.

c. Does DB-SCAN fulfill the requirements we have for a clustering


algorithm? Explain your answer.

scalability​: DB-SCAN has a worst case complexity of O(n²) and is slower


than k-means. If we use the euclidean distance, it scales well with dimension.

arbitrary shape​: DB-SCAN supports clusters of any shape

robustness:​ DB-SCAN depends heavily on the parameter choices but is


robust against noise and supports finding outliers

plausible results:​ DB-SCAN gives plausible results. Because we don't have


to specify the number of clusters beforehand, it can find the optimal number
of clusters for the parameters automatically. We have to consider density
differences though.

d. What are the parameters we can adjust for DB-SCAN?

minimal number of points m​: if this is too small, we potentially get a lot of
small clusters of points that should be outliers, if it is too high, we don't obtain
clusters that should be viewed as cluster

maximal distance ε : if too small, we only get clusters in very dense areas, if
too high, clusters get very big

e. In the following diagram, are ​A and B​, ​A and D and ​A and C density
connected if the circles represent the maximal distance ε ?

9
A-B yes, A-C yes, A-D no

f. DB-SCAN can handle clusters with homogenous density much better


than clusters of different density. What method do you know that solves
this problem?

OPTICS

22. Hierarchical Clustering


a. What is divisive and agglomerative clustering?

divisive​: top down approach, subdivide clusters into smaller clusters

agglomerative:​ bottom up approach, merge clusters into bigger clusters

b. How does the linkage criterion influence cluster shape for


agglomerative hierarchical clustering (AHC)?

We have different linkage criteria that define when 2 clusters are merged. We
assume we have 2 clusters A and B.

single linkage:​ select pair of points (one from A, one from B) with minimum
distance to each other, merge clusters where this distance is minimal
sensitive to outliers, may result in long and thin clusters, supports arbitrary
shape, cannot separate clusters properly if there is noise between clusters
min{d(a, b) : a ∈ A, b ∈ B }

complete linkage:​ select pair of points (one from A, one from B) with
maximal distance to each other, merge 2 clusters where this is minimal
prefers spherical shapes, tends to split large clusters
max{d(a, b) : a ∈ A, b ∈ B }

average linkage: ​merge two clusters if the average distance of all pairs
(between those clusters) is minimal
1
|A|·|B| ∑ ∑ d(a, b)
a∈A b∈B

centroid linkage​: compute centroid of A and B, merge 2 clusters where


distance of centroids is minimal

10
c. How does AHC perform based on our requirements for clustering
methods?

scalability​: naive approach is O(n³), so it does not scale well, enhanced


methods are better with O(n²)

arbitrary shape:​ depends on linkage criterion, complete linkage prefers


spherical shapes

robustness:​ depends on linkage criterion, single linkage not as robust to


outliers

plausible result:​ results are plausible, AHC is used most commonly 5

d. Given the following Venn-diagram as a result of a hierarchical


clustering algorithm, draw the corresponding dendrogram.

e. What is hierarchical clustering good for?

​ etermine an
Hierarchical clustering can, for instance, be used to d
appropriate cluster number for algorithms like k-means. It reveals
similarity relations between clusters and provides a level-of-detail
extraction​ for clusters

5
​https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/

11
23. Mixed topics - Clustering
a. Most clustering algorithms need some kind of distance/similarity metric.
Give some examples how such a metric can be designed? Consider
that the individual dimensions may strongly differ in their range (in case
of scalar values) and in their data type, e.g. data may be categorical or
numerical.

We can define a distance metric for categorical data as well. Eg, we can say
the distance is 0, if objects have the same category and 1, if they have
different categories.

b. What can we do to improve the clustering result?

We can cluster with c ​ ertain additional constraints.​ We can specify which


items should never (c ​ annot-link​) or always (​must-link)​ be in the same
cluster. Not all criterias can be fulfilled at the same time, the degree of how
many criteria are fulfilled should be an evaluation criterion. Constraint
clustering is then a (semi-)supervised method instead of unsupervised.

c. What is special when we cluster time dependent data?

​ hould reflect how clusters emerge,


For time dependent data, visualization s
move, merge, split or disappear. Next to the actual clusters, we also want
to know how clusters change over time. We can adjust the distance metric to
be independent of time.

d. Why is it generally a good idea to show more than one cluster result?

Clustering results depend on multiple parts (the algorithm, parameters and


initialization). Remember: VA aims to generate insight, so we can compare
different clustering results that have high quality but are dissimilar. This gives
us a different view on the data

e. What is multi-clustering?

Computation and visualization of a set of clustering results (alternative


clusterings). This is done because clustering algorithms depend on the
algorithm, parameters and, in case of stochastic algorithms, on the
initialization. This can yield very different plausible results. Often, there is no
right solution for clustering, so we want to compare the results.

24. Name 3 categories in which clusters can differ?

12
count, size, density, clumpiness, number of outliers, shape, centroid position

25. Imagine we have a very large dataset with 10 dimensions we have no little to
no information about. What clustering algorithm would you suggest and why?
What preprocessing steps do you need to perform for your choice?

(This is a very open question that relies on arguments given)


I would use OPTICS for clustering, because for k-means we need to choose k but we
have no information about the data set. We also don't know or can guess if the
clusters have roughly spherical shape or are noise free, so a clustering method that
can deal with any shapes and noise would be better. OPTICS is also a little better
than DB-SCAN as it can deal with different densities. For OPTICS (like DB SCAN)
we need to estimate the minimum number of points and maximum distance. This we
would need to do in preprocessing.

another answer may choose k-means because we have a large data set and high
dimensions. k-means is one of the fastest methods. noise and outliers can be dealt
with in preprocessing. k we can find in different ways

26. What interactions can we perform on clusters?


We can color code clusters, add/remove items, select clusters, adjust parameters,
set constraints, zoom/pan the visualization, …

27. Some algorithms require an a priori estimation of the expected number of


clusters. How can the user be assisted in defining this number?

Elbow method
"Using the "elbow" or "knee of a curve" as a cutoff point is a common heuristic in
mathematical optimization to choose a point where diminishing returns are no longer
worth the additional cost. In clustering, this means one should choose a number of
clusters so that adding another cluster doesn't give much better modeling of the
data." 6
​ ithin-Cluster-Sum of Squared Errors (WSS) for d
Calculate the W ​ ifferent values of
k​, and choose the k for which WSS first starts to diminish. In the plot of
WSS-versus-k, this is visible as an ​elbow. 7

6
​https://en.wikipedia.org/wiki/Elbow_method_(clustering)
7
https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb

13
Silhouette method
​ ilhouette coefficient f​ or different numbers of k and choose the k
Calculate the s
where it peaks.

Hierarchical Clustering
If we cluster hierarchical, we can step through the tree and visually decide what
depth is appropriate for the data set.

28. Given the following 2D datasets, which clustering method would you use to
cluster the data? Explain your answer.
(This also depends on the explanation although I would say it is often pretty
obvious which will give the best result)

a.
I would use OPTICS because we have 3 bars of different density. For
DB-SCAN it will likely be difficult to find a good global parameter and we
either have one big cluster or only the middle and the rest are outliers.
k-means might also work but only if the centroids are very well chosen.
k-means: DB-SCAN

14
b.
I would use k-means with a parameter of k=7 because it looks like we have
about 7 circular clusters. DB-SCAN would probably assign everything to one
big cluster or have several small clusters as the density is slightly varying, eg
in the middle.
k-means: DB-SCAN
(it is possible that it does not converge
to this solution though)

15
c.
I would use DB-SCAN because it looks like we would obtain 4 clusters of
higher density and the rest of the points are outliers (noise). k-means cannot
detect the ring cluster, because it does not allow non-convex / split clusters.
k-means: DB-SCAN

Subspace Clustering
29. Why would we do subspace clustering instead of global clustering?

Global clustering is limited to about <= 10-15 dimensions​. The higher the
dimension, the sparser the data becomes. Sparse data has no or rarely any clusters.
The data is only similar on some dimensions. These dimensions form a subspace.
Also, noise, irrelevant data and highly correlated dimensions reduce quality of
global clustering.

30. What is the difference between subspace search and subspace clustering? Is
there an advantage of one over the other? Explain your answer.
(I think this is the same as the question: "Subspace clustering can be realized

16
in an integrated manner or in a decoupled manner. Explain these two
approaches." from the example questions)

Subspace search only searches for eligible subspaces, it does not perform
clustering in those subspaces​. Subspace clustering does both in one algorithm.
Subspace search has an advantage. It is more flexible, less biased (as clustering
often works with assumptions) and more effective, since uninteresting subspaces are
filtered before the clustering step.

31. What is subspace clustering? What is a clusterable subspace?

Subspace clustering is the combination of subspace search and clustering.​


Subspace search is the search for low-dimensional representations of high
dimensional data, that is useful for grouping. So we want to find a part of the dataset,
that shows some potential for finding groups in that part. Eg in medical data, we
usually have a lot of information that might not be useful for grouping. To identify risk
groups for a certain disease that does not show in blood values, we can disregard all
blood value testings we might have acquired, so we only use a subspace of all those
values.
The c​ lustering part is the same as for global clustering. We partition the subspace
into groups of similar items.

Subspace is ​not clusterable if the points have the same distance to all of its
neighbours​. it is clusterable if there are differences in density. (I think SURFING
used this as measure)

32. What preprocessing steps might we need to perform before attempting a


subspace search/clustering?

We need to perform preprocessing before we attempt a subspace clustering. We


need to either remove or interpolate incomplete data, normalize the data to a range
of 0 to 1 and deal with categorical data. Categorical data can either be transformed
into numerical data (as a lot of algorithms do not deal with them separately), prune
them or use an algorithm that can deal with them.

33. What heuristics can we use to prune the search? Why is this useful?

Subspace search grows large. We have 2n − 1 possible axis-aligned subspaces with


n being the number of dimensions. If we consider arbitrarily oriented subspaces, the
number is infinite. It is therefore infeasible to do a brute-force approach.​ As
heuristics, we can prefer s​ ubspaces with high variance, limit us to axis-aligned
subspaces, prefer lower-dimensional subspaces or give an expectation for the
cluster number​. We also want to aim to reduce redundancy in our clusters.

17
34. Name and describe a subspace search algorithm.

RIS 8 ​(ranking interesting subspaces)


o is a core object if it has more than minPoints in its ε -neighbourhood
N ε (o) ≥ minPoints

a dense region is formed by a core object and all objects that are connected to it
We count all objects in dense areas in the subspace S = C OU N T (S)

the interestingness is then the number of objects in dense areas divided by the
volume of the subspace
I nterestingness = COU N T (S) / V OLU M E(S)

We can then rank the subspaces according to this interestingness measure. Higher
interestingness means higher relevance.

The parameters minPoints and ε can be chosen according to a heuristic.


minPoints = ln(n) with n… number of points in the subspace

SURFING​ 9 (Subspaces Relevant for Clustering)

Analyses a histogram for k-nearest neighbours.​ Subspaces with ​non-uniform


distributions are more interesting -> interestingness. If we would use the whole
dataset, we get performance problems really fast, so the algorithm only uses around
​ ottom up approach. A subspace gets extended if it is
5% of all elements. This is a b
more interesting in higher dimensions (so we add dimensions and do not keep
redundant subspaces of subspaces). This does not assume a specific clustering
structure and is ​applicable to a wide range of numbers of dimensions​. As we
perform k-NN, we need to give the algorithm a k or range of ks. It is ​stable for k
between 5 and 20​. High-ranked s ​ ubspaces tend to be redundant with this
algorithm.

35. RIS
The following picture represents a 2D subspace.

8
https://www.dbs.ifi.lmu.de/Publikationen/Papers/PKDD03-RIS-final.pdf
9
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.5588&rep=rep1&type=pdf

18
a. Name all core objects when the circles represent the ε -Neighbourhood
of the objects and minPoints = 5. How many core objects would we
have if minPoints = 4?

5 is a core object for minPoints = 5, We would have 5 and 13 as core objects


if minPoints=4.

b. How many objects contribute to a dense region? (Count every object


only once if there are overlapping dense regions)

COUNT(S) = 6 (5 points + 1 core object)

c. Compute the interestingness, if the Volume V(S) = 10.

COUNT(S) / V(S) = 6/10 = 0.6

36. What are the paradigms we can divide subsearch clustering into? Name an
example for an algorithm for each of them. Explain one in detail.

cell-based:​ divide subspace in cells, if number of items in cell is above threshold →


cluster
CLIQUE​ (Clustering in Quest)
bottom-up algorithm, Clusters represent connected components in a graph where the
nodes represent dense units, can generate arbitrary shaped clusters, heavily
depends on grid resolution and threshold

density based​: local density is higher than average density


SUBCLUE
slower than CLIQUE but produces slightly better results, heavily depends on global

19
density threshold, biased towards lower dimensions

clustering based​: steered by global parameters related to an expected result eg


number of clusters
Proclus​ (projected clustering)
based on number of clusters C and average dimensionality D → similar to k-means,
start with C random medoids, try to fit a subspace to them, get refined until results do
not get better
Proclus prefers large clusters, is efficient, simple and robust against noise

37. CLIQUE
Which cells would belong to a cluster if minPoints = 5?

A1, C1, C2

38. How can we deal with categorical data in subspace clustering?

We can transfer categorical data into numerical data or use a frequent item mining.
We need a special normalization for categorical, continuous and hybrid data. Some
methods for categorical data use a density and frequency estimation.

39. How can we evaluate the cluster quality?

To evaluate the clusters, we can use an artificially created data set ​to test if
algorithms return known clusters but we also need to test r​ eal world data as a
benchmark (as real world data is often not as well behaved as artificially created
data). The artificial data should also be of similar size and dimension as the real
world data.
​ valuation criteria like purity.
We can also evaluate the clusters based on cluster e
For real world data, it might be useful to get ​feedback of experts regarding
plausibility. A cluster should represent a concept and have relevance for decision
making.

20
We can also try how the clusters change if we ​change parameters​ in the algorithm.

40. What are the limitations of subspace clustering?

Subspace clustering can deal with high dimensions, but m​ ore dimensions than 100
are too much information to be explored efficiently. We can push this limit by
including background information about the data in our subspace search (constrained
clustering). This is more scalable and improves accuracy. C ​ lustering algorithms
only identify dense areas.​ Their primary goal is not to identify correlated
dimensions, patterns or outliers.

Visualization
41. What needs to be visualized to convey results of subspace clusters? Which
information you would consider important for an overview and which
information could be revealed on demand as interesting details?

For subspace clusters, we need to know which dimensions contribute to the


subspace, how many clusters we have, how they are distributed (and/or separated)
and how much subspaces overlap.

For an o ​ verview,​ it is enough to see a visual representation of subspaces, their


relative size and their overlap. This gives us enough information to see redundant
subspaces and which are the most important one.
In a ​detailed view, we should be able to see contributing dimensions, how clusters
are distributed in the subspace and cluster properties such as quality metrics, size,
count, shape... . We should also be able to see outliers if they exist. With even more
detail, we should be able to select individual clusters to see the distribution of data in
the individual dimensions or even select individual data points.

42. Why do we need a visualization for subspace clustering especially? What are
some challenges that arise in subspace clustering visualization?

For partially overlapping subspace clusters, we are ​not able to a


​ pply established
automatic quality assessments​. So v ​ isualization is the main quality
assessment.​ We have to consider even more data and perspectives than for global
clustering.

The main challenges we face are o ​ verlapping subspaces and dimensions of


​ embership of objects to certain subspaces​.
subspaces as well as the m

43. What criteria can we use to assess subspace clustering results?

In the visualization, we want to see non-redundancy (we want clusters to not have
too many overlapping dimensions, not too many overlapping instances), ​coverage

21
​ luster characteristics
(are most instances and dimensions part of a cluster) and c
(size, compactness, dimensions involved).

44. What needs to be displayed in a subspace clustering visualization?

We need to display ​properties of individual clusters, compare clusters and see


the cluster quality.​ Cluster properties include number of objects, dimensions, which
dimensions were removed, and the distribution in each dimension. A comparison
between clusters can take place with respect to differences and overlaps in the
aforementioned properties. Cluster quality can be assessed in different ways (see
evaluation for example).

45. Name techniques to visualize clusters.

parallel coordinates, scatter plot, heat map, linked views, clust nails

46. Name one advantage and one disadvantage (each) for visualizing subspace
clusters with a heat map and parallel coordinates.

heatmap:​ Heatmap represents similarity of the detected subspaces (w.r.t.


overlapping dimensions). (This refers to slide 53 of the subspace clustering lecture. I
have no idea if this is what they meant)
The paper 10 says it is the similarity between selection of subspace clusters in the
MDS projection. Each row and each column represent one of the selected clusters.
Each cell, as a combination of two clusters, represents the similarity or dissimilarity
between two clusters by means of color.
Advantage:​ It is easy to see how many and which subspace (clusters) are similar to
each other.
Disadvantage:​ Not possible to see which dimensions contribute to a subspace or
how clusters in them look like.

10

https://scibib.dbvis.de/uploadedFiles/Hundetal2016Visualanalyticsforconceptexplorationinsubspa.pdf

22
. From the paper mentioned at heatmaps: "The MDS projection, however, can distort
the perception of similarities as in many scenarios there is no optimal 2D
representation of all pair-wise similarities which results in perceivable patterns which
are not given in the underlying data.

parallel coordinates​: can show clusters of subspaces with colored lines in a parallel
coordinate view.
Advantages:​ contributing dimensions and members of a cluster well visible.
Disadvantage:​ parallel coordinates can look cluttered very fast. Visuals depend on
ordering of dimensions.

47. "Clust Nails"


a. Explain how the visualization technique works.

For a subspace, each dimension has a weight based on variance. The


contributing subspaces are then visualized in a radial layout for easy
comparison. The weight determines the length of the nail.

b. What does it visualize?

It visualizes how much a dimension contributes to a subspace.

c. Given a 5D dataset, we have the following weights: dim1: 10%, dim2:


20%, dim3: 0%, dim4: 40%, dim5: 10%. Sketch a clust nail
representation for this subspace.

23
48. How can we visualize the relations between subspaces?

We can visualize the relations between subspaces with ​trees and graphs,​ but those
do not scale well to many subspaces. They may, however, be useful for zoomed in
portions (detailed views).

Another way is d ​ imension glyphs.​ We project the subspace to a 2D scatter plot


view (see dimension reduction techniques) and above the view, which dimensions
are involved in this subspace.

49. Describe Subspace Clustering Visualization technique x ∈ {ClustNails,


SubVis, OpenSubspace}. Consider also the interactive exploration of the
visualization.
(I find this to be a really stupid question. OpenSubspace is literally just a

24
screenshot and googling brings you to a non-existent website.)

50. What interactions can we perform on a visualization?

We can have a lasso selection, view detailed information about involved dimensions
(after selection), store relevant results, brushing to other (similar) subspaces. An
experienced user can add/remove dimensions.

Cluster Validation
51. By what means can we rate the quality of a cluster? What failure cases can
arise for clusters?
(Alternative formulation: "How can clustering results be evaluated? Consider
qualitative and quantitative aspects.")

A cluster can contain false positives (objects assigned that should not be part of a
cluster) and ​false negatives (objects not in the cluster that should be). False
positives render the cluster less meaningful as they do not contribute to the concept
the cluster should represent. False negatives mean that an essential structure was
not detected.

In addition to a ​global assessment​, individual clusters analyzed, w.r.t. specific


boundaries. For a global score, we can have a weighted combination of individual
cluster scores. If available, we can c ​ ompare the clustering result to a gold
standard (​ eg manual clustering by experts, q ​ ualitative aspect). If this is not
available, we can fall back to ​purity measures ​(q ​ uantitative aspect). Another
qualitative aspect can be the visualization itself. The user can look at a visualization
and try to assess if the clustering makes sense.

52. Name 3 quality measures. Explain one in detail.

Silhouette coefficient
We have the clusters A and several cluster B i , and objects a ∈ A and b ∈ B i . We
1
then compute the average distance da = d(o, A) = |A|−1 ∑ d(o, a) of o ∈ A to other
a∈A
objects in A and the minimum average distance
1
db = min{ d(o, B i ) = |B i | ∑ d(o, b) | ∀ B i =/ A} to every other cluster B i . This means, if
b∈B i

the object o is right inside the cluster A, da well be smaller than db .


The silhouette S (o) of object o is then
db −da
S (o) = max{db , da } ∈ [− 1, 1] . A negative value means o is closer to another cluster
than A, which indicates it might be classified poorly.

25
The coefficient for the clustering is then
1
sc = nc ∑ ∑ S (o) where C is the set of clusters and nc = number of all objects in all
c∈C o∈c
clusters
(​The slides are a bit confusing but are supposed to match the definition on Wikipedia
https://en.wikipedia.org/wiki/Silhouette_(clustering)​ )

centroid-based measure
For this method we compute the centroids of a cluster and compare the distance of
the cluster elements to the centroid of their own cluster to the distance to the centroid
of all other clusters. A centroid is the average of the positions of points in a cluster.
So ci is the centroid of cluster i, pi ∈ C i is a point in Cluster i.
Then if we have ∀pi dist(pi , ci ) < dist(pi , cj ) , the clusters are perfectly separated. For
a cluster obtained by k-means, this should automatically be the case.
The portion of points, where this inequality doesn't hold indicates the cluster quality.
To obtain a measure for the whole clustering result, we have sum weighted by cluster
size.
A downside of this measure is, that split or interwoven clusters give low values.
Narrow and curved clusters also produce low values and convex, compact shapes
give high values. The method is relatively robust against different sizes, densities and
clusters.

grid based measure


Overlay the domain with a grid and count, how many cells contain points of different
clusters. A good clustering gives few cells. It heavily depends on grid size but is very
robust against split and concave clusters.

53. Given the following distance matrix for objects ai ∈ A, bi ∈ B and A and B
being clusters, compute the silhouette coefficient.
a1 a2 a3 b1 b2 b3

a1 0 1 2 3 6 5

a2 1 0 3 4 5 8

a3 2 3 0 6 7 6

b1 3 4 6 0 2 1

b2 6 5 7 2 0 1

b3 5 8 6 1 1 0

26
d(a1 , A) = (1 + 2)/2 = 3/2 , d(a2 , A) = (1 + 3)/2 = 2 ,
d(a3 , A) = (2 + 3)/2 = 5/2 , d(a1 , B ) = (3 + 6 + 5)/3 = 14/3 ,
d(a2 , B ) = (4 + 5 + 8)/3 = 17/3 , d(a3 , B ) = (6 + 7 + 6)/3 = 19/3

d(b1 , A) = (3 + 4 + 6)/3 = 13/3 , d(b2 , A) = (6 + 5 + 7)/3 = 18/3 ,


d(b3 , A) = (5 + 8 + 6)/3 = 19/3 , d(b1 , B ) = (2 + 1)/2 = 3/2 ,
d(b2 , B ) = (2 + 1)/2 = 3/2 , d(b3 , B ) = (1 + 1)/2 = 1

S (a1 ) = (14/3 − 3/2) / (14/3) = 19/28


S (a2 ) = (17/3 − 2) / (17/3) = 11/17
S (a3 ) = (19/3 − 5/2) / (19/3) = 23/38
S (b1 ) = (13/3 − 3/2) / (13/3) = 17/26
S (b2 ) = (18/3 − 3/2) / (18/3) = 3/4
S (b3 ) = (19/3 − 1) / (19/3) = 16/19

sc = 1/6 * (19/28 + 11/17 + 23/38 + 17/26 + 17/26 + 3/4 + 16/19) ≈ 0.805115


The clustering seems to be good. The value is quite high.

54. Given the following average distances for objects a, b, c coming from clusters
A, B, C respectively, compute the silhouette coefficient. Based on the
silhouette coefficient, is the clustering good?

A B C S (a1 ) = (13 − 15) / 15 = − 2/15


S (a2 ) = (16 − 10) / 16 = 6/16
a1 15 13 18
S (a3 ) = (12 − 7) / 12 = 5/12
a2 10 16 23 S (b1 ) = (15 − 12) / 15 = 3/15
S (b2 ) = (17 − 12) / 17 = 5/17
a3 7 12 19 S (c1 ) = (20 − 3) / 20 = 17/20
b1 15 12 18 S (c2 ) = (21 − 5) / 21 = 16/21
S (c3 ) = (17 − 10) / 17 = 7/17
b2 17 12 20

c1 sc = 1/8 * (− 2/15 + 6/16 + 5/12


20 21 3
+ 3/15 + 5/17 + 17/20 + 16/21 + 7/17)
c2 21 21 5 ≈ 0.39701505

c3 18 17 10
The clustering is not very good. The value is not close
to 1.

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping
clusters. Negative values generally indicate that a sample has been assigned to the
wrong cluster, as a different cluster is more similar.

27
55. Given the following images, would a centroid based measure give a good or
bad purity measure for this clustering?

1 2 3

1 - bad, 2 - good, 3 - bad

56. "We only need one purity measure to judge the quality of our clustering."
Do you agree with this statement? Justify your answer.

This statement is not true. A measure often prefers certain cluster structures or
depends on chosen parameters. A centroid based measure for instance performs
poorly for split clusters although they may be well separated. A grid based measure
would say they are fine, given the grid size was chosen accordingly. So one measure
does not fit every case which is why we want to look at multiple. If all say the result is
poorly clustered, we have a much higher certainty this is true than if only one says it
is poorly clustered.

57. Given the following clusterings, determine if silhouette coefficient,


centroid-based measure and grid-based measure (with the underlying grid
size) would give a good or bad result for the cluster purity. Would you regard
the clustering as good (from a qualitative point of view)? Explain your answer.

a.
silhouette-coefficient: I'd say it is pretty good. The clusters are obtained by
k-means and are nicely separated. The intracluster distance should be good.
centroid-based:​ Same as with silhouette coefficient. Clusters are obtained by

28
k-means so they are guaranteed to be optimal regarding their centroids.
grid-based​: Should give a good result. We have only a few cells that have
points of different clusters in it.
qualitative​: The clustering looks good and seems to capture the underlying
pattern of multiple circles well.

b.
silhouette-coefficient​: I'd say the silhouette coefficient does not give a good
result, mainly because the red points (the ring) are many and nearly always
have another cluster closer to them.
centroid-based:​ The centroid based measure would give a bad result
because of the red points. Nearly all of them are closer to another centroid.
grid-based: The grid-based measure would give a perfect result as we have
no mixed cells.
qualitative: I'd regard this clustering as good because it reveals the pattern of
a smiley-face very well and also regards outliers.

c.
silhouette-coefficient​: I'd say the silhouette coefficient is not good for this
example. The clusters are close to each other and often there are quite small
ones, so the points at the cluster edges might be not closest to their own
cluster.

29
centroid-based:​ I'd say the centroid based method gives an ok result. The
clusters are roughly convex and the points classified as outliers give some
spacing.
grid-based: T ​ he grid based measure gives a bad result for the clustering. We
have a lot of mixed cells. A smaller grid would improve the measure.
qualitative: ​I would regard this as a bad clustering result. The clusters do not
reveal the underlying circle structure, outliers seem random and plenty.

d. Explain why we need visualization to fully assess the quality of


clustering. (Hint: Imagine we used a much larger grid size in (b). ).

A larger grid size in b would lead to mixed cells and therefore a bad result for
the measure. This would mean all of the measures we used would say we
have a bad clustering. Still, if we look at the clustering visualization in the
scatterplot, we can see that the clustering is actually good and probably what
we would expect. So even if the quantitative measures fail, it does not mean
the clustering is necessarily bad. The visualization helps us to see that.

Cluster Visualization
58. What are the tasks we want to perform when visualizing clusters?

We want to i​ nterpret, evaluate and refine clusters​. Evaluating includes determining


the cluster quality and refining merging and splitting clusters or adding/removing
elements.

For a 2D clustering, we also want to show a distribution and statistical properties per
cluster, indicate what point belongs to which cluster and allow user selection. A fuzzy
clustering should also include the probability of a point belonging to a cluster. For
high dimensional data we want to preserve the distances in a projection so clusters
are perceived as such.

59. Name and explain 3 techniques for cluster visualization.

distance matrix
A distance matrix is an early technique. We calculate the distance of each object to
each other object and reorder the matrix based on that values. The cluster should
become visible.

30
11

glyphs
We can use glyphs to represent certain aspects. For instance we can color or
different shapes to represent membership to a cluster. A glyph should be consistent
and perceived as similar if the underlying data is similar. Glyphs can be combined
with other visualizations such as scatterplots

12

scatterplot
A scatter plot representation often requires some kind of projection to 2D or 3D. The
visualization depends on parameters of clustering and the projection technique used.
The distortion introduced through projection can make patterns appear that are not
inherent to the data. An animated scatterplot may serve well for a temporal
component.

13

parallel coordinates
DImensions are displayed as vertical lines, data entries are curves that intersect the

11
​https://upload.wikimedia.org/wikipedia/commons/7/7a/Distance_matrix.PNG
12

https://res.cloudinary.com/practicaldev/image/fetch/s--hmMb2h34--/c_limit%2Cf_auto%2Cfl_progressi
ve%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/720/1%2ATEYPlUQfggUVnqu26Qi
Q3g.png
13

https://www.researchgate.net/profile/Xingwang_Zhao/publication/220604252/figure/fig2/AS:74685558
7651585@1555075656855/Scatter-plot-of-the-ten-cluster-data-set.png

31
dimension lines at the position that represents their value. Clusters can eg be color
coded. The visual presentation can be enhanced by varying opacity and edge
bundling.
14

isolines
Isolines can be used to show the degree of membership in fuzzy clustering.

enhanced dendrograms
Dendrograms can be used for hierarchical clustering. They summarize grouping. A
big downside is that they do not scale well and do not enable verification if a cluster
makes sense. An enhanced version can also take histograms of clusters into
account.

3D hierarchy visualization
These are also for hierarchical clustering. The clusters can be displayed as semi
transparent nested surfaces.

Outlier Detection
60. What characterizes an "Outlier" in the data? Why do we want to detect them?

An outlier is a data ​point that deviates strongly from all of its neighbours.​ They
are the extrema in a data set. What this means is highly context dependent. In high
dimensions, they are hard to identify because data is sparse anyway.

We want to detect outliers, because they can either be interesting or unwanted​. An


unwanted outlier can, for instance, be a measuring error. Some algorithms cannot
deal well with outliers in the data, so we should remove them in a preprocessing
step.

61. Discuss the advantages and disadvantages of outlier removal.

On the one hand, outlier removal can be beneficial for the analyst. ​Some algorithms
cannot deal with outliers and produce wrong or counterintuitive results. Outliers in

14
https://upload.wikimedia.org/wikipedia/en/4/4a/ParCorFisherIris.png

32
measured data are also often e ​ rrors and do not convey useful information, so the
quality of the data gets better if we remove them.

On the other hand, outliers ​can be interesting points that do not stem from noisy
measurements. In that case, they can give new information. Imagine a person being
immune to a disease. They are definitely outliers in the data but can give information
on what gives an immunity (eg gene defects). So i​ f we remove them, we lose
valuable information.​

This means we have to consider removing them very carefully and based on
assumptions if they belong in the data or not and based on the insight we want to
get. For example, the immune person might be interesting for finding a cure but not
for evaluating how a disease spreads.

62. What methods can we use to detect an outlier?

depth based
Depth based methods define outliers as t​ he boundary of a distribution​. This
method uses the convex hull to identify outliers. Points on the convex hull (depth 1)
are most likely outliers, we therefore remove them

distance based
Distance based methods define outliers as having an abnormal distance to its
neighbours.​ So if a point has fewer than α% of all points in its ε − neighbourhood.

density based
Density based methods define outliers as having ​abnormally low density in their
area.​ Simple techniques do not detect outliers reliably if clusters having different
densities are not clearly separated.

cluster based
Cluster based methods define outliers as n ​ ot part of a cluster.​ Generally, clustering
algorithms are good at finding clusters, not outliers, so this should be used with care.
A lot of similar outliers could be considered a cluster.

63. In the following image, what points are outliers according to depth based
outlier detection?

33
the outer 3

64. What can the output of outlier detection algorithms be?

continuous:​ a continuous value for the degree of "outlierness"


discrete:​ integer (eg depth in depth based methods)
binary:​ is outlier or is no outlier

65. What is the main criteria for validating outlier detection algorithms?

precision:​ If we have ground truth where we know how many outliers we have, we
can test if an algorithm gives us the same number and the same objects. If a dataset
(with M objects) has N outliers and an algorithm returns the top-N outliers: compute
the size of the intersection divided by N. If N << M, precision is likely low, whereas for
N <M it is better (with the same algorithm)

The intersection are the outliers that our algorithm found and that are also present in
the ground truth. If they are 100% the same, our algorithm is very precise. The
probability of our algorithm finding the same outliers is higher, if more outliers are
present. An algorithm will have it much harder if only 3 of 10,000 points are ground
truth outliers than if 300 of 10,000 points are outliers. So to judge the precision, we
will have to consider the ratio of N to M. For comparability between algorithms, we
subsequently have to choose the same conditions (preferably the same data set and
parameters).

Biclusters
66. What are Biclusters? What do they represent? What properties do they have?

Biclustering is also called co-clustering or block-clustering. It is


a data mining technique that ​searches for subsets in
bipartite graphs.
A bipartite graph is a graph whose vertices can be divided into
two disjoint sets U and V. The edges are only between
elements of U and V, but not between U and U or V and V.
There is also no element without an edge.

This graph ​represents relations between two types of nodes, eg. people associated

34
with organizations. Biclustering is often done on c ​ ategorical data.​ A biclustering
algorithm can produce a massive amount of biclusters (similar to subspace
​ ften redundant or overlap.​
clustering). Those biclusters are o

67. To what type of data can biclustering be applied and which results occur?
Give an example for data to which biclustering can be applied?

Biclustering is mostly applied to ​categorical data​, but can also be applied to


numerical data. Numerical data can be compared by absolute difference
​ dditive-related)​ or their relation to each other (​multiplicative-related​). In
(a
numerical data, n ​ oise should be considered (eg by heuristics or stochastics). A
fuzzy clustering for numerical data is often better than binary.

Gene expression data is a common application for biclustering.

68. Do biclusters overlap or not?

Yes, biclusters can overlap because not the whole subspace is clustered but only a
subset of the subspace. These subsets can overlap for different biclusters.

69. What pre- and post processing might we perform when biclustering?

Preprocessing includes traditional steps like outlier removal, noise reduction,


discretization, data transformation and normalization.

Postprocessing includes removal of biclusters that are too small, filter for clusters
with high quality and ranking them according to size.

70. Algorithms

Biclustering algorithms are often parametrized wrt m ​ inimum number of


rows/columns​. Most return a ​binary clustering of categorical data,​ but numerical
data can also be biclustered with fuzzy algorithms (eg FABIA). An exhaustive search
would be an NP-hard problem, so we often use heuristic algorithms, that search
biclusters bottom-up. Examples for bicluster algorithms are Bimax, LMAX and
Charm.

71. Describe an example for chained biclusters.

Chained biclusters are biclusters that are ​related to each other by their rows or
columns​. An example would be people related to organizations and organizations
related to locations.

35
15

72. What are common applications of biclustering?

​ xpression data, social network


Common applications for Biclusters are gene e
analysis and association rule mining.

Visualization
73. What tasks do we want to perform with a bicluster visualization?

A common visualization task is ​gaining an overview over properties like size,


number of rows and columns and what rows and columns are clustered. We also
​ how individual biclusters members and metadata and enable o
want to s ​ verlaps
and comparisons between multiple biclusters. For fuzzy biclusters, we also need to
show ​membership values​ and a possibility to ​transform it to a hard cluster.​

74. What are requirements for a bicluster visualization? What are parameters for
initial generation?

​ calable wrt number of rows and columns,


A bicluster visualization needs to be s
number of biclusters and number of rows and columns in overlap.

Parameters for the initial generation or filtering of a biclustering are the ​(percentage)
minimum/maximum number of rows and columns, noise threshold for
numerical data and maximum overlap.

75. Name and explain 2 techniques for visualizing biclusters.

table based:
The table based approach shows rows and columns in a table. For up to 2 biclusters,
we can simply reorder the rows and columns to have a continuous representation of
the cluster. For >2 overlapping biclusters, we need duplicated rows/cols.

15

https://www.researchgate.net/figure/Chaining-four-biclusters-through-multiple-relations-by-approximat
ely-matching-sets-of_fig1_301697851

36
Duplications should be marked and do not scale well. Several algorithms compute
layouts that minimize duplications.
We can color biclusters with an individual color, but this does not scale well too.
Interaction in table based approaches includes selecting biclusters as focus or for
highlighting, sorting, labeling, zooming and enable/disable duplicates.

parallel coordinates:
In parallel coordinates, we can color lines by bicluster, but this approach does not
scale well and we have a problem with overlapping biclusters. We can also link the
visualization to a table based approach.

graph display/node-link:
A node-link representation can be used for displaying chained biclusters. They
represent a m:n: … :z relation, eg. the relation between patient, disease and
treatment. Each node is a bicluster and each edge is the link between biclusters.
Links width can be scaled by frequency

set based:
Set based approaches show the biclusters as subsets of bipartite graphs (remember
the definition). Naive representation has high visual clutter, which can be reduced by
inserting an abstraction in between. The abstraction can also be used to show
additional info, like how many items are involved. We can also use the length of the
bundle to show the size of the bicluster or small rectangles in the item list view
represent frequency in the dataset (e.g. in text documents, where words/names may

37
occur multiple times).
Interaction in set based approaches can be sorting and ordering and interactive
placement of the elements.

BicOverlapper
Overlapping for biclusters can also be displayed by transparent regions, which scales
better than a node-link diagram. We can search for nodes, highlight connections, fix
node positions and navigate through the graph.

Scatterplot-based Visual Representations


76. What do scatterplots represent? For what type of data is a scatter plot
representation useful? What is the purpose of a scatterplot visualization?

A scatter plot represents the ​distribution of 2 continuous variables based on


samples of those variables. We can identify ​sparse and high density areas. They
are used to a ​ nalyze clusters and correlations, identify outliers and compare
​ n the same axes.
dataset o

77. What is a multi-class scatterplot?

A multi-class scatterplot is a ​scatterplot that also shows the class of the data
points (​ eg what cluster they belong to). For instance we show the relation between
height and weight in a scatterplot and color the points according to if a person is male
or female.

78. Name 4 applications for scatterplots.

38
analyze correlations
detect outliers
analyze clusters
comparison of data on the same axes

79. Discuss overplotting and visual clutter and how it can be reduced for
scatterplots.

metrics of visual clutter:​ screen space statistics (number of used pixels, number of
free pixels, collisions)
item number, redundancy, grouping, contrast, saliency ("Heraussragen")

possible solutions:​ reduction of glyph size, change of shape, jitter, transparency,


color, rounding → all of these should be automatic

distortion​: We can use distortion to reduce clutter by increasing screen space for
dense regions and decreasing screen space for sparse regions. They are more
difficult to interpret than undistorted visualizations → need interaction to interpret
them and distortion should be adjustable.

80. Is linear scaling always the best to use? Discuss. What else do we have to
look out for when scaling the scatter plot? Sketch examples for badly scaled
scatter plots.

Linear scaling of the axes is the best to easily interpret the data set. But linear scaling
can lead to large sparse regions, so to use the screen space efficiently, we can also
scale axes differently. Log-scaling or square root scaling are also used often and are
understood well enough.

We also have to consider outliers. They can cause the scatterplot to scale awkwardly
if they are far away from every other data point. Generally, a scatterplot should be
scaled in a way that the data has some distance from the frame but isn't crammed in
a proportionally small part of the domain shown.

81. Why is a 3D scatterplot not useful (in a lot of cases)?

3D scatterplots show a distribution of 3 variables. They try to exploit the human visual
system that can understand spatial relations well, but ​points are very hard to
perceive in their spatial relation as we are missing depth cues.​ It also suffers
from ​occlusion problems and a mentally demanding interaction​. That's why it

39
mostly preferred to show the combination of each axis separately.​

82. Explain the idea of SPLOM and GPLOM. What does a GPLOM do better?
Where do you see scalability limits?
(related question: Imagine that your data is mixed with some dimensions
being numerical and others categorical. How can the scatterplot matrix be
extended to display such data.)

SPLOM (Scatterplot Matrix)


A scatterplot matrix shows every combination of dimensions in a grid like layout. The
diagonal is either empty or shows data for the dimension (like the name or a
histogram). A SPLOM is a way to visualize higher dimensional data, but because it
scales quadratically, it is only feasible for lower dimensions.​ The upper and
lower triangle are mirrored, so it would be enough to only show half.

16

GPLOM (Generalized pair plot matrix)


A generalized pair plot matrix is similar to a scatter plot matrix but does also deal with
categorical data. The pairing of categorical and numerical data are represented as
bar charts, categorical data are represented as heat map and continuous variables
as scatter plot. A GPLOM can therefore show mixed data.

17

16
https://miro.medium.com/max/1400/1*C-BCaajZWvSAujSWeBAZBQ.png
17

https://www.researchgate.net/profile/Jean_Francois_Im/publication/256837289/figure/fig4/AS:409148
761100291@1474560075349/Example-plots-extracted-from-a-matrix-generated-with-the-gpairs-pack
age-in-R-7-Top.png

40
83. Explain 2 advanced scatter plot techniques.

Concentration Ellipses
Concentration ellipses are an abstract depiction of a multiclass distribution. Multiple
semi transparent ellipses with solid borders are fitted to the data points of a class.
Outliers may be excluded. To properly explore the data, interaction is needed as the
ellipses are only an abstract depiction.

18

Binning
With binning, we abstract the data at hand. Multiple points are aggregated into
groups.
There are several variants on a binned scatter plot. We can eg bin by the groups
​ iscrete density plot) , a
used in histograms, by a rectangular grid of the domain (d
​ exbin​) on the domain or by adaptive ranges. The grid based
hexagonal grid (H
​ ensity plot​.
approaches are also called d

19 20 21

Splatter Plot 22
Splatter plots are similar to concentration ellipses but they can have arbitrary shape.
They are based on a density calculation over the whole domain. The edge of the
shape is an isoline matching a density threshold. The colors for data points are
blended based on density. The resulting polygons are smoothed. To avoid distracting
outliers, sparse regions are filtered and sampled.

18

https://lh3.googleusercontent.com/proxy/CDCpdo7ae1el2fqujpM9zpPHmM_6qvNnSq4wdr9-42Sl7ZO
TerKMq2bBpnqjO-LCRUKNwZ3sW5moTnJAYWCABNdCnSxV55Re1xAUkA-62OPPovDWd4-y5qF6
VUpgeCkJgL1SKLJniHt8MrM8ZiBttsmcLqm4dxcN_utOhUM_LSUd3hU8qRwpOmXW75uylA
19
https://doc.dataiku.com/dss/latest/_images/grouped-scatter.png
20
https://www.mathworks.com/help/examples/matlab/win64/BinScatterPropertiesExample_01.png
21

https://datavizproject.com/wp-content/uploads/2015/11/Sk%C3%A6rmbillede-2016-01-28-kl.-10.56.25
.png
22
https://graphics.cs.wisc.edu/Papers/2013/MG13/splatterplots-final.pdf

41
Generalized scatterplot
A generalized scatter plot deals with distortions. The user can move a slider to go
from an undistorted view that may contain overlapping and overdrawing to an
distorted but overlap-free view. Here, the interaction is essential to understand the
data.
The algorithm to obtain the distorted view follows an iterative approach. Each data
point is added one after another. If the space, where the point is supposed to go has
another point already, the new point is displaced a little bit. We do this until all points
are added.

84. What is "Scagnostics"? What attributes can we derive from a scatterplot


(name 3)?

Scagnostics stands for ​scatter plot + diagnostics​. It is based on geometrical


analysis of the point distribution. Scagnostics is not meant for the end user and
requires collaboration between data scientists and domain experts.

Attributes that are analyzed are: outlierness, skewness, clumpiness, sparseness,


striartedness, convexness, skinniness, stringiness, monotonicity

85. Visual analytics comprises interaction – which interaction techniques are


useful for scatterplot-based visualization?

In a scatterplot-based visualization, we can make use of standard interaction like


zoomin, panning and autoscaling. Lasso or rectangle interaction gives us the
possibility to select clusters or data points for a more details in a linked view.
Hovering over points can give the values for a single point or help comparing a point
to nearby ones.
We can use lenses to deal with the problem of overplotting by local zooming or local
distortion (focus-context visualization).
For a distorted scatter plot, the user also needs to be able to adjust the level of
distortion eg. by a slider or mouse interaction.
For an animated scatterplot we need the possibility to adjust playback speed,
start/stop the animation and rewind.

86. How can we evaluate a scatterplot visualization?

criteria: visual separability of clusters, correct assessment of correlations,


trust/certainty of interpretation

ideal properties: avoids overlap, indicates amount of overlap, is scalable, is


adjustable

42
87. Describe measures to compare two or more scatterplot-based
representations.

See previous question + maybe scalability

Dimension Reduction (DR)


88. General Questions
a. What is dimension reduction?

Dimension reduction are reduction techniques to r​ educe or transform high


dimensional data to lower dimensions.​

b. On what assumption is dimension reduction based on?

Dimension reduction is based on the assumption that ​most of the variance


is captured by intrinsic dimensionality of the data. This means data often
consists of more dimensions than it would be necessary to describe the
information in the data. Dimension reduction wants to find this intrinsic
dimensionality.

c. Why do we need dimension reduction?

High dimensional data is hard to analyze and visualize.​ Eg. a 10D data
set would result in 45 scatter plots in a SPLOM. In high dimensions, it is also
nearly impossible to find global clusters.

d. What are the two major ways to reduce dimensions?

feature selection:​ remove dimensions and only keep a subset of the original
dimensions

feature extraction​: transform data to a lower dimensional space.​

e. What are drawbacks of dimension reduction?

It always involves a​ loss of information.​


Dimensions n ​ eed to be centralized ​(zero means) and ​normalized (divide by
the range or σ) (auto scaling)
Drawback of auto scaling: ​Noisy measurements are scaled up whereas
large peaks in meaningful data get reduced
Strongly correlating dimensions hamper the result (should be removed
upfront, remember feature selection)
Interpretability of the new dimensions challenging;​ often domain

43
scientists are not satisfied

89. What properties should the result of a DR fulfill?

DR should p ​ reserve important structures such as clusters, outliers and


correlations. Similarity in HD should be represented by proximity (similar → spatially
close).

90. Explain Correlation-based feature selection.

Correlation based feature selection is a feature selection technique that ​prunes


dimensions which do not add a lot of information.
It starts with an empty set and adds a feature (dimension) with high information gain.
We then add more features that do not strongly correlate with any other feature in the
set. After each one, we compute a quality value (there exist different approaches). If
the quality value decreases after we add a feature, the algorithm terminates.

91. Explain the concept of progressive dimension reduction (PDR).

"Computational complexity of DR techniques do not allow direct employment in


interactive systems. This limitation makes the analytic process a time consuming
task that can take hours, or even days, to adjust the parameters and generate
the right embedding to be analyzed. The idea of Progressive Visual Analytics is to
provide the user with meaningful intermediate results, in case computation of the final
result is too costly. Based on these intermediate results the user can start with the
analysis process." 23

Progressive DR is part of progressive visual analytics and is based on the idea that
the ​user may start the analysis on intermediate results in case the full
computation takes too long​. So, PDR reduces only a subset of the points at first.
The user can then already adjust parameters in case the projection is poor without
waiting for the full result. This speeds up the process of analyzing.

92. What preprocessing steps might we need to perform before doing a


dimension reduction?

Before we do an advanced dimension reduction technique, we can preprocess the


data. In dimension space, we can already r​ emove dimensions that do not add
new information (eg strongly correlated dimensions can be replaced by
representative and low variance dimensions can be removed entirely). In item space,
we can r​ emove outliers​ because they influence variance and correlations strongly.

23
https://nicola17.github.io/publications/2016_AtSNE.pdf

44
Linear Techniques
93. What is the idea of linear dimension reduction (LDR) techniques? What data
is suitable for those methods?

LDR techniques generate a new ​set of dimensions where each new dimension is
a linear combination of the others. LDR is suitable for ​normal distributed data​. It
does not scale well for very HD data​ and has a limited degree of freedom.

94. Principal Component Analysis (PCA)


a. Explain the general idea of PCA.

PCA g ​ enerates a new coordinate system with orthogonal dimensions​.


The n ​ ew dimensions are linear combinations of the old ones and are
called p ​ rincipal components (PCs)​. Each PC carries a ​loading ​(the
variance) that characterizes how much variability of the data is explained by
it. Starting with the highest loading, we choose m PCs until a threshold is
reached or we reach a maximum number of dimensions.
The p ​ rojection error from the original n dimensions to the m PCs is
minimized in the least square sense​ by this technique.

See also this website 24.

b. What are the steps of the algorithm of PCA?

Normalize and centralize data


Determine covariance matrix C
Apply Eigenvalue analysis to C (the eigenvectors are the PC axes, the
eigenvalues give the variance for this PC)
Reduce dimensions from n to m

c. How can we visualize a PCA?

score plot:​ scatterplot of the data for the largest 2-3 PCs

24
https://setosa.io/ev/principal-component-analysis/

45
25

loading plot: show how much each original dimension contributes to a PC


(either bar chart, or by arrows)

26

scree plot: shows variance (eigenvalues) for each PC. The scree plot is
missing a lot of information eg the influence of the original variables on the
PC

27

d. Is PCA suitable for cluster analysis?

Classic PCA is s ​ ensitive to outliers and assumes normal distributed


data.​ it ​does not preserve clusters and is therefore not useful for cluster

25
https://support.minitab.com/de-de/minitab/18/principal_components_loan_applicant_score_plot.png
26

https://support.minitab.com/de-de/minitab/18/principal_components_loan_applicant_loading_plot.png
27
https://upload.wikimedia.org/wikipedia/commons/a/ac/Screeplotr.png

46
analysis. It is also challenging to interpret and needs preprocessing.

e. Name 3 possible disadvantages of PCA.

difficult to interpret, does not preserve clusters, sensitive to outliers

95. Factor Analysis (FA)


a. Explain the general idea of FA. How does it differ from PCA?

FA is based on second order statistics. It tries to f​ ind underlying factors for


the dimensions we have​. This means in PCA we have the underlying basis
and want to construct a new basis. In FA we have the new basis and want to
find the underlying basis. Mathematically speaking, if dimi is the ith original
dimension:

PCA
PC1 = a * dim1 + b * dim2 + c * dim3 + …

FA
dim1 = x * factor1 + y * factor2 + z * factor3 + ...

b. Name an example where FA is used often.

Factor analysis is often used in psychology to find factors in a personality. Eg


answers to a test form the original dimensions and we explain these answers
by personality traits like intelligence, optimism, introversion….

96. Projection Pursuit


a. Explain the general idea of Projection Pursuit (PP) 28.

In PP, each d ​ imension is associated with an index that describes how well
the dimensions separates the data. The resulting projection reveals clusters
very well. It can also be applied to individual clusters (eg results of global or
subspace clustering).

Projection index is defined as a combined measure (product) of ​spread and


local density.​ Spread is a modified (trimmed) version of std. dev. Local
density is the average distance of point pairs within a cutoff radius. We want
wide spread and high local density.

bad projection index: good projection index:

28
https://towardsdatascience.com/interesting-projections-where-pca-fails-fe64ddca73e6

47
b. Sketch (in the sense of roughly describing) how the algorithm works.

PP is an efficient ​iterative algorithm.​ We start with a set of initial projection


directions and optimize it wrt the projection index until convergence. This is
only a local optimum so we might need to do this with several random
initializations.

c. Is Projection Pursuit suitable for cluster analysis?

Yes, it preserves clusters very well and is a powerful tool for exploratory data
analysis. It also is robust against noise and efficient.

Non-linear Techniques
97. What is the idea behind non-linear projection techniques?

Non-linear techniques allow m​ ore degrees of freedom and are suitable also for
skewed and multimodal distributions. They ​try to preserve small distances
because small distances are often more interesting than large distances.

98. Multi-dimensional scaling (MDS)


a. What is the general idea behind MDS?

MDS ​preserves distances optimally when transforming from high to low


dimensional space. It is formulated as an optimization problem that iteratively
minimizes stress. ​Stress is defined as the d ​ ifference between distances
in HD and LD. We can choose different distance metrics.

b. What are applications of MDS?

We can use MDS for text analysis, force directed layouts or to show results of
subspace clustering.

48
c. What is the advantage of the variant "progressive steerable MDS"?

MDS is generally slow ( O(n³), fastest is O(n²) ). To speed it up, or rather to


avoid unnecessary calculations, we can ​only transform a small subset of
the points first and let the user choose what region to transform then. This
way we avoid wasting time on regions the user is not interested in right now.
It also gives an immediate overview for further exploration.

99. Stochastic Neighbourhood Embedding (SNE)


a. Sketch how the algorithm works.

SNE p ​ reserves global and local structures when mapping from HD to


2D/3D. HD and LD euclidean distances are converted to condition
probabilities (according to a Gaussian centered at a point). We then m​ ove
the points in the projection until the distributions are similar enough​.

b. What is the advantage of the variant t-SNE?

SNE has the problem that it projects a lot of points into the center. This leads
to ​crowding.​ t​ -SNE uses a student-t distribution instead of a gaussian and
evens out the density.​ It therefore solves the crowding problem.

c. Is t-SNE suitable for cluster analysis?

SNE and t-SNE ​preserve clusters.​ t-SNE expands dense clusters and
contracts sparse clusters, so it is ​not suitable to compare relative cluster
size and relative distances between clusters.

100. Name other non-linear projection techniques.

sammon mapping, isomaps, locally linear embedding, self-organizing maps, crafted


projections (user input guides projection)

Projection Quality
101. Error
a. By what means can we analyze the projection error?

objectively/quantitatively:​ error metrics


subjectively/qualitatively​: visualization that indicates error distribution

b. What are the essential questions we have to ask ourselves regarding


the error?

49
How is the error spread?
Where are the points located that have a high difference in distance between
projection and HD?
How does parameter choice affect the error?

c. What types of errors can we distinguish?

displacement error, correlation between distances, tear (points close in HD


but not LD), false neighbourhood (far in HD but not LD), clusters

102. What can we use as a measure for an error metric?

A lot of ​DR techniques optimize some error function → can use those as error
metric​. Eg preservation of distances, overlap between k-nearest neighbours,
agreement in ranking between k-nearest neighbours, correlation of pairwise
distances

103. How can we visualize the error?

A visualization should provide cues about local reliability in the structure directly. We
can show it per point or per region (eg as color in the background of the scatter plot).
The type of error should be conveyed as well.

Examples are Voronoi Cells (their drawback is, they also show error in areas without
points), false neighbour and missing neighbour view 29 and stress maps 30.

104. What is the problem when the user wants to compare different techniques
interactively? How can we decrease it?

The user might want to explore and compare different DR techniques with multiple
parameters. Some DR techniques are very slow, so ​interactivity is difficult
performance wise.​ We can still provide interactivity by using only approximative
solutions, use multigrid and multisolver techniques and enabling GPU-support.

Assisted Dimension Reduction


105. Describe forms of interaction between an analyst and the dimension
reduction algorithm.

29

https://www.semanticscholar.org/paper/Visual-analysis-of-dimensionality-reduction-quality-Martins-Coi
mbra/9e7105b77093b1947637040034b0d7b80ab35a20
30

https://www.semanticscholar.org/paper/Stress-Maps%3A-Analysing-Local-Phenomena-in-Reduction-
Seifert-Sabol/15ecef22524c2cb650fed2357e0d0b7feefe1625

50
I guess the next 2 questions answer this as well.

DR is a very automated process. The user may choose the algorithm, some of its
parameters (eg what distance metric to use (MDS) or perplexity and noise (SNE).

106. Explain Visual Hierarchical Dimension Reduction (VHDR).

In VHDR d ​ imensions are grouped into a hierarchy, and lower dimensional


spaces are constructed using clusters of the hierarchy.
First, all the original dimensions of a multidimensional data set are organized into a
hierarchical dimension cluster tree according to similarities among the dimensions.
Similar dimensions are placed together and form a cluster, and similar clusters in turn
compose higher-level clusters. The user interactively selects interesting dimension
clusters from the hierarchy in order to construct a lower dimensional subspace. A
representative dimension (RD) is assigned or created for each selected dimension
cluster. 31

107. How does dimension reduction through user defined quality metrics work?
What quality metrics could we use? Explain 2.

Dimension reduction through user defined quality metrics is basically ​interactive


feature selection.​ The user defines an error metric and dimensions, that do not
contribute are "discarded". No projection is performed and no new dimensions are
generated. The loss of information is displayed to the user who can select
dimensions himself or another error metric.

We can for instance use ​outlier preservation rate as an error metric. The algorithm
will try to preserve outliers when reducing dimensions. For each pair in several
dimensions (2D, 3D, …) it is recorded for which dimension selection a point is
considered an outlier.

Another one is ​cluster dominance​. The algorithm tries to remove only dimensions
that do not contribute to subspace clustering. Loss of information relates to the
dimensions removed that contribute to clusters.

108. What is "Seek a view"? What is it used for?

"Seek a view" helps the user to find interesting subspaces. For outliers, negative
correlations and other things, more than one subspace can be relevant.
It is a combination of visual representations (views) used to interpret subspaces.
Filtering enables reduction for iterative refinement.

31

https://www.researchgate.net/publication/220778426_Visual_Hierarchical_Dimension_Reduction_for_
Exploration_of_High_Dimensional_Datasets

51
109. Why is guidance so important for dimension reduction tasks?

Dimension reduction includes a lot of steps. We need to transform data, filter


subspaces, annotate things (eg assign meaningful labels to new dimensions), search
for highly correlated variables that can be merged or low variance variables that can
be removed and we need to select a DR technique and adjust the parameters.
it is therefore very complex and overwhelming for the user to use such a system
​ uidance helps the user to find a suitable workflow
and get insight from the data. G
without forcing a certain way to them.​ D ​ imStiller 32 is an example for a guidance
system.

Decision Trees (DTs)


110. What is a Decision Tree (DT)? What is it used for?

​ ecision support tool that uses a ​tree-like graph to model


A decision tree is a d
decisions and their possible outcome. It is a ​supervised learning method that
makes no assumptions on data distribution. It is ​used to classify data.

111. What are the components of a Decision Tree? Make a sketch of a DT and
label them.

112. Algorithm
a. Explain steps of the algorithm to build a decision tree. ​(Other
formulation: Explain the basic tree induction algorithm.)

- ​start with an empty tree and entire data set

32
https://eprints.cs.univie.ac.at/4209/1/vast10_dimstiller.pdf

52
- if: all samples of current node have same label c → leaf node of class c
- else: select splitting attribute s that is "most useful" to support a decision →
make a decision node and branches for s →
split dataset according to s
- recursively repeat until stopping criteria reached

b. What are possible stopping criteria?

We can stop if 1 ​ 00% or above some threshold samples are correctly


classified​. If we use the threshold, we assign the label to the leaf node that
the majority of the data in that branch has. We can also stop if we have a
maximum number of branches, maximum tree depth, maximum leaf
nodes , maximum decision nodes,​ ...

c. What measures can be used to decide if an attribute is a good splitting


attribute? Explain 2.

We often use either information gain or gini gain as a goodness/impurity


measure to determine if an attribute is a good splitting value. Both measure
the difference between the entropy of the current tree and the entropy of the
tree after the split.
For both, we w ​ ant to have the maximum gain,​ which means we want to
maximize our function.

entropy 33:
The entropy of the set of items in the current node S is given by
k
E ntropy(S) = − ∑ pi · lg(pi ) where pi is the probability of occurrence of class i.
i=1
The entropy "measures" chaos. Uniform probability (meaning same chance
for each class) yields maximum uncertainty and therefore maximum entropy.
So the entropy decreases if our node reaches a homogenous class.

information gain:
The information gain is then
|S v |
I G(S, A) = E ntropy(S) − ∑ |S| Entropy(S v ) where S is the set of items at the
v∈dom(A)

current node, A the potential splitting attribute, v a splitting value of the


attribute and S v the set of items after splitting by value v of attribute A.
Information gain favours attributes with a lot of splitting values.
We want to minimize the entropy -> want a big difference in entropy before
and after -> maximize information gain.

gini index:

33
https://en.wikipedia.org/wiki/Entropy_(information_theory)#Definition

53
The gini index is defined as
k
Gini(S) = 1 − ∑ pj2 . It describes how often a randomly chosen element from S
j=1

would be incorrectly labeled. The gini index is similar to entropy but does not
have the expensive logarithmic operation. 0 for the gini index means perfect
split.

gini gain:
The gini gain is defined similarly to the information gain, but uses the gini
index instead of entropy. We want to maximize this again.
|S v |
GG(S, A) = Gini(S) − ∑ |S| Gini(S v )
v∈dom(A)

113. Validation and Evaluation


a. How can we validate a DT?

One big flaw of DTs is the p ​ otential overfitting to training data.​ Small
changes in the input data can yield a very different DT. That is why validation
is very important.

Cross validation is the standard method for validation. The training data gets
partitioned into k distinct subsets. We use k-1 of those to train the tree and
the k-th to validate if the tree classifies unknown data correctly. We cycle
through all k subsets and use the tree with the highest accuracy.

Another possibility is ​separate training and evaluation data​. The


disadvantage here is that we do not use all the data we have for training and
the potential for overfitting is higher, as we only have one tree.

b. Write down the formula for Effectiveness of a DT. What two criteria
does it combine? Explain them.

Effectiveness of a Decision Tree (EDT) is a ​combined measure for


accuracy and size​. Accuracy says how good a tree can predict the target
variable. Size (or complexity) says how good the decision process can be
interpreted. A very accurate but large tree is often overfitted to the
training data and therefore less accurate for other data. A large tree is also
hard to interpret. The decisions might seem arbitrarily chosen and do not give
good insight on why a classification was chosen that way.

N correct − k·N leaf


E DT = N , k is the penalty for a leaf node. The value of k depends
on applications. For instance, if accuracy is the only concern, then k is equal

54
to zero and EDT is equal to the accuracy of the model. 34

c. How can we reduce the complexity of a DT?

We can reduce the complexity by ​preprocessing the data before creating the
DT or ​modify the DT.​ A u
​ ser steered process ​with domain experts can also
reduce the complexity.
Preprocessing can be feature selection, outlier removal and clustering
(making 1 DT per cluster).
After the DT was constructed, we can reduce the complexity by pruning or
merging splitting values or rounding off splitting values to enhance
interpretability.

114. Exercise 11 - Block1


You and your friends want to decide more quickly whether the weather is
good enough to go out or not. For that reason, you set up a database
recording the weather situations from 14 previous occasions and the
information whether you went out on that day or not, leaving you with 5
features for each entry, depicted in Table 1. For each occasion, the database
contains information on Chance of Rainfall, Humidity, Wind, Temperature, and
whether you went out that day or not (Activity). Induce a decision tree by hand
on the dataset in Table 1 with Activity as target attribute and Gini gain as
goodness measure for attribute test conditions. Perform multiway splits* when
applicable.
* Multiway split describes the situation in which a node has more than two
sub-trees.
(​This exercise takes VERY long. An exam question will more likely be
smaller if something like it is in it. If it is not, do this last.)

I also made a mistake while calculating it, so the split criteria for the root
and therefore for the rest of the tree is different than in the solution
provided. The general algorithm is the same though.

34
​https://www.sciencedirect.com/science/article/pii/S1071581906001078​ (access via university login)

55
● Gini Index: GI(S) = 1 − (pout ² + pstay ²)
|S v |
Gini Gain: GG(S, A) = GI(S) − ∑ |S| GI(S v )
v∈A
● Root node
|S| = 14, p_out = 6/14, p_stay = 8/14, GI(S) = 24/49

○ Rain
Rain <30%, |S| = 7, p_out = 5/7, p_stay = 2/7, GI(Rain <30%) = 20/49
Rain >30%, |S| = 7, p_out = 1/7, p_stay = 6/7, GI(Rain >30%) = 12/49
GG(S, Rain) = 24/49 - (1/2 * 20/49 + 1/2 * 12/49) = ​8/49
○ Humidity
Humidity low, |S| = 4, p_out = 2/4, p_stay = 2/4, GI(low) = 1/2
Humidity medium, |S| = 7, p_out = 4/7, p_stay = 3/7, GI(medium) =
24/49
Humidity high, |S| = 3, p_out = 0, p_stay = 1, GI(high) = 0
GG(S, humidity) = 24/49 - (4/14 * 1/2 + 1/2 * 24/49 + 0) = ​5/49
○ Wind
Wind soft |S| = 8, p_out = 4/8, p_stay = 4/8, GI(soft) = 1/2
Wind strong |S| = 6, p_out = 2/6, p_stay = 4/6, GI(strong) = 4/9
GG(S, Wind) = 24/49 - (8/14 * 1/2 + 6/14 * 4/9) = ​2/147
○ Temperature
<10*C |S| = 5, p_out = ⅕, p_stay = 4/5 , GI(<10C) = 8/25
>10C and <35, |S| = 7, p_out = 4/7 , p_stay = 3/7 . GI() = 24/49
>35C |S| = 2, p_out = 1/2 , p_stay = 1/2. GI(>35C) = 1/2
GG(S, Temp) = 24/49 - (5/14 * 8/25 + 1/2 * 24/49 + 2/14 * 1/2 ) =
29/490

56
Rain gives us the highest Gini Gain, so we will split our dataset
according to rain probability first

● Rain <30%
|S| = 7, p_out = 5/7, p_stay = 2/7, GI(Rain <30%) = 20/49 (we already
calculated that)
○ Humidity
low, |S| = 4, p_out = ½, p_stay = ½, GI(low) = 1/2
medium, |S| = 3, p_out = 1, p_stay = 0, GI(medium) = 0
high, |S| = 0
GG(S, Humidity) = 20/49 - (4/7 * ½) = ​6/49 = 0.122
○ Wind
soft, |S| = 4, p_out = ¾, p_stay = ¼, GI(soft) = 3/8
strong, |S| = 3, p_out = ⅔, p_stay = ⅓, GI(strong) = 4/9
GG(S, Wind) = 20/49 - (4/7 * 3/8 + 3/7 * 4/9) = ​1/294
○ Temperature
<10*C |S| = 3, p_out = 1/3, p_stay = 2/3 , GI(<10C) = 4/9
>10C and <35, |S| = 3, p_out = 1 , p_stay = 0 . GI() = 0
>35C |S| = 1, p_out = 1 , p_stay = 0. GI(>35C) = 0
GG(S, Temp) = 24/49 - (3/7 * 4/9 ) = ​44/147 = 0.299

For this branch, the next splitting attribute is temperature.

● Temperature <10*C
|S| = 3, p_out = 1/3, p_stay = 2/3 , GI(<10C) = 4/9
○ Humidity
low, |S| = 3, p_out = ⅓, p_stay = ⅔, GI(low) = 4/9
medium, |S| = 0
high, |S| = 0
GG(S, Humidity) = ​0
○ Wind
soft, |S| = 2, p_out = ½, p_stay = ½, GI(soft) = 1/2
strong, |S| = 1, p_out = 0, p_stay = 1, GI() = 0
GG(S, Wind) = 4/9 - ⅔*½ =​ 1/9

We split according to wind


● Wind soft
|S| = 2, p_out = ½, p_stay = ½, GI(soft) = ½
○ Humidity
low, |S| = 2, p_out = 1/2, p_stay = 1/2, GI(low) = 1/2
medium, |S| = 0
high, |S| = 0
GG(S, Humidity) = ​0

This is the last possibility to split but it results in a mixed node,

57
so we can also leave this out. We need to assign a label to the
leaf node but the result is still mixed. We have no majority, so we
can just decide which label we want to use. I will use "stay
home".
And because we also have "stay home" in the other branch, we
can also leave out the split for wind and make it a leaf node.

● Wind strong
|S| = 1, p_out = 0, p_stay = 1, GI() = 0
is leaf → stay home
● Temperature >10C and <35C
|S| = 3, p_out = 1 , p_stay = 0 . GI() = 0
is leaf → go out
● Temperature >35C
|S| = 1, p_out = 1 , p_stay = 0. GI(>35C) = 0
is leaf → go out
● Rain >30%, |S| = 7, p_out = 1/7, p_stay = 6/7, GI(Rain >30%) = 12/49
○ Humidity
low, |S| = 0
medium, |S| = 4, p_out = 1/4 , p_stay = 3/4 , GI(medium) = 3/8
high, |S| = 3, p_out = 0, p_stay = 1, GI(high) = 0
GG(S, Humidity) = 12/49 - (4/7 * ⅜) = 3 ​ /98
○ Wind
soft, |S| = 4, p_out = ¼, p_stay = ¾ , GI(soft) = 3/8
strong, |S| = 3, p_out = 0, p_stay = 1, GI(strong) = 0
GG(S, Wind) = 12/49 - (4/7 * ⅜) = ​3/98
○ Temperature
<10*C |S| = 2, p_out = 0, p_stay = 1 , GI(<10C) = 0
>10C and <35, |S| = 4, p_out = 1/4 , p_stay = 3/4 . GI() = 3/8
>35C |S| = 1, p_out = 0 , p_stay = 1. GI(>35C) = 0
GG(S, Temp) = 12/49 - (4/7 * 3/8 ) = ​3/98

All of them have the same Gini Gain → choose one. (I will use
Temperature).
● Temperature <10*C
|S| = 2, p_out = 0, p_stay = 1 , GI(<10C) = 0
is leaf → stay home
● Temperature >10C and <35C
|S| = 4, p_out = 1/4 , p_stay = 3/4 . GI() = ⅜
○ Humidity
low, |S| = 0
medium, |S| = 2, p_out = ½, p_stay = ½, GI(medium) = 1/2
high, |S| = 2, p_out = 0, p_stay = 1, GI(high) = 0
GG(S, Humidity) = ⅜ - (½ * ½) = ​1/8
○ Wind
soft, |S| = 1, p_out = 1, p_stay = 0, GI(soft) = 0

58
strong, |S| = 3, p_out = 0, p_stay = 1, G(strong) = 0
GG(S, Wind) = 3 ​ /8

Wind results in leafs only, so we split according to that.

● Temperature >35C
|S| = 1, p_out = 0 , p_stay = 1. GI(>35C) = 0
is leaf → stay home

● The resulting tree is:

115. Exercise 11 - Block 2


Compare Random Forests and traditional decision trees. What are the
advantages and disadvantages?

(I used information from 35 and 36 )


Random forest,​ like its name implies, ​consists of a large number of individual
decision trees that operate as an ensemble. Each individual tree in the random
forest spits out a class prediction and the c ​ lass with the most votes becomes our
model’s prediction.​ Prerequisites for random forest to perform well are:
- There needs to be some actual signal in our features so that models built using
those features do better than random guessing.
- The predictions (and therefore the errors) made by the individual trees need to
have low correlations with each other. Low correlations between the trees in the
forest will help to even out errors from individual trees.

35
https://towardsdatascience.com/understanding-random-forest-58381e0602d2
36
https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991

59
Advantages of a decision tree are easy interpretation, relatively easy visualization,
perform well on large data sets and are fast. On the downside they are easily
overfitted to training data and need to be optimal (which is difficult as they can vary
greatly depending on the training data).

A random forest can be much more accurate because the end result is an
aggregation of multiple low correlated decision trees. They are harder to visualize
and are slower, because they are essentially a bunch of decision trees.

Visualization
116. Next to the DT itself, what is profitable to also show in the visualization?

The DT itself only shows the result of the DT process. To really ​gain an
understanding of how the decision was made,​ it is also beneficial to show training
data and its statistics. If the real world data does not follow similar distribution as the
training data, it might explain poor decisions.

117. What are the requirements we have for a decision tree visualization?

The DT should be first displayed in its entirety for an overview and subtrees should
be selectable for a detailed view. Leaf nodes and splitting nodes should be easily
distinguishable and leaf nodes should have the class label visible. Ideally we also
can see the distribution of the splitting attribute at the splitting node or a linked,
synchronized view. At all times, the accuracy should be updated and presented.

118. What interactions might we want to perform with the visualization?

As always, we want to perform panning, zooming and requesting details. The user
might also interact with the tree by merging, splitting or deleting decision nodes.

119. Name the basic techniques to visualize a DT. Explain 2. Which one is the
best (according to studies)?

There are 4 basic techniques.

The ​outline view / indentation diagram is basically how a folder-file structure looks
like. It can be combined with expanding/collapsing of subtrees.

A ​node-link diagram is the easiest to understand but is not efficient screen space
wise. Enhanced node-link diagrams can also display distributions and number of
involved items at each node.

The ​treemap is very efficient space wise but difficult to understand regarding the
decisions made. A variant is the tree ring, which uses a radial layout.

60
The icicle plot is more efficient space wise than a node-link diagram but the
hierarchy is better perceived than for a tree map. Of all basic techniques, it has the
best trade-off between screen-space efficiency and interpretability.

outline view node link

37 38

tree map icicle plot

39 40

120. What are advanced techniques to visualize a DT? Explain one in detail.

Advanced techniques are for instance i​ mproved versions of basic techniques.

Node-link diagrams can be improved by Bezier curves for edges, histogram of the
target variable and collapsed subtrees. The width of the edges can be used to show
how frequent a path was chosen. Details can be shown on mouse over.

Icicle plots​ could show info of the training data but it does not scale well.

An advanced technique involves the user ​choosing splitting values interactively.

37

https://img1.daumcdn.net/thumb/R800x0/?scode=mtistory2&fname=https%3A%2F%2Ft1.daumcdn.n
et%2Fcfile%2Ftistory%2F204DFC284AF572570A
38
https://miro.medium.com/max/3840/1*jojTznh4HOX_8cGw_04ODA.png
39

https://images.squarespace-cdn.com/content/v1/55b6a6dce4b089e11621d3ed/1528204277811-JX4H
T3U2578DXA5CIW7O/ke17ZwdGBToddI8pDm48kPHmLxVe8SfwV-YoKPCx7JMUqsxRUqqbr1mOJY
KfIPR7LoDQ9mXPOjoJoqy81S2I8N_N4V1vUb5AoIIIbLZhVYxCRW4BPu10St3TBAUQYVKct9wL8Tz
yCYlAdUTfmg9wVFcML89r8uCmInwS8AiaUqBgfJJEHi9xFhV3nuSB_8WT/Treemap-with-measure-na
me-labels.png?format=750w
40
https://ars.els-cdn.com/content/image/1-s2.0-S1071581906001078-gr5.jpg

61
The user defines a hyperplane for splits based on a scatterplot matrix. The initial line
can be automatically computed with support vector machines.

Baobab is a visualization that combines multiple views to provide an overview and


details as well as different aspects of the DT and training data. The tree is displayed
as a node-link diagram where the layout can be ordered in different ways.

121. What is the advantage of semi-automatic construction of a DT?

A semi-automatic construction lets domain experts use their knowledge to ​reduce


complexity of the tree and optimize the accuracy-complexity ratio.​

122. How can we visualize error metrics and quality of a DT?

We can validate a DT by visualizing the error it makes. Possible visualizations are


decision quality plot, confusion matrix and the projection of misclassification into the
spatial domain.

decision quality plot


A scatter plot with the quality measures accuracy and tree size is used as an
overview. The pareto front gives the best trade off between those two. We can
choose the best decision tree out of it.

confusion matrix
the confusion matrix indicates how often a class label was confused with another. It
may reveal patterns in misclassification.

we can also ​project misclassification into the spatial domain to find clusters of
misclassification.

123. What evaluation criteria do we have for visualizations of DTs?

A visualization should make it easy to identify the ​tree topology, node relations
and leaf size.​ It should also be able to adjust the layout to user preferences.

Intertopic Questions
124. Explain the relation between dimension reduction and subspace clustering.

Dimension reduction and subspace clustering are closely related. With


dimension reduction, we want to reduce a dataset to its intrinsic
dimensionality. Intrinsic dimensionality means that most of the variance of the
data is captured by fewer dimensions than the data has now. If a dimension
does not contribute to a subspace cluster, it is not part of this intrinsic

62
dimensionality, so we can very likely remove it from the dataset.

Also, if we do dimension reduction (eg feature selection) before attempting to


do a subspace clustering, we can have a smaller search space for subspace
clustering which speeds up the time we need for to perform the clustering.
Subspace clustering is the same as reducing the dataset to the subspace and
performing a global clustering.

125. Explain the difference between global clustering, subspace clustering and
biclustering.

global clustering:​ all data points, all dimensions, (typically) non-overlapping clusters
subspace clustering:​ all data points, selected dimensions, overlapping subspaces
biclustering​: selected data points, selection of 2 dimensions as feature vectors,
overlapping clusters, applied to 2D data and the results are restricted to rectangular
shapes

all of them can be performed on categorical or numerical data, but global and
subspace clustering is mostly done on numerical data, while biclustering is done on
categorical data

126. Select one example for a problem and describe the high-level design of a
visual analytics system to tackle it.

127. What is the general problem of grid-based approaches?

Grid based approaches are always dependent on grid size. it is difficult to find a
fitting global grid size. To overcome this, one can use an adaptive grid size.

128. Classify the following algorithms:


PCA, RIS, Projection Pursuit, DB-SCAN, SUBCLUE, Factor Analysis,
k-means, OPTICS, AHC, MDS, SURFING, Proclus, SNE, tree induction
algorithm, BIMAX, CLIQUE
according to the categories below:

63
Algorithms

64
Really cool visualization of clustering:
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

65

You might also like