Professional Documents
Culture Documents
VA - Exam Preparation
VA - Exam Preparation
DISCLAIMER
This is NOT an official document provided by the lecturer or associates. They are also
not based on passed exams. This means they can be way above or below the
difficulty of the actual exam and wording can differ. The provided answers are not
necessarily correct. I thought of those questions (and the answers) myself and use
them to deepen my understanding of the topic. I just think they can be helpful to
others too.
General
1. What is Visual Analytics (VA)? Why would you use it? What is the main aim of
VA?
The data sets are very large and complex. T his information overload means we
can get lost in data, that might be irrelevant or we process and present it in an
inappropriate way.
The main aim (or core objective) is to get insight into the data (hence the focus on
exploration), eg. to understand t rends, patterns, relations and new hypotheses.
We want e ffective understanding, reasoning and decision making. Usability and
performance are not the main requirements the techniques try to meet.
2. Name the core objectives for VA and give at least 1 example for each of them.
Trends: climate change, share price (deutsch: Aktienkurs), infection rate with corona
virus
Patterns: day and night cycle in energy consumption, increased sells of christmas
1
themed goods in November and December, long time climate cycles
3. For what kind of data is VA not useful and why? Give 2 examples and explain
why they fit into this category.
VA is not useful if the data is of high quality and the data size is moderate. This
means we do not need to gain new insight and can just visualize the data without
additional analytics. The insight will be obvious. An example is weather data where
we only have temperature and a date. We can put this into a simple time-temperature
diagram. Another example is the distribution of grades in a university course. We can
simply put it in a bar plot. This is easily interpretable and does not contain erroneous
data.
large data sets: crime statistics, census data, Amazon product data base
inconsistent data: same person registering multiple times for the same service, 2
sensors measuring the same thing but get different results, results of non
deterministic algorithm run with the same parameters (eg due to random seeding)
2
5. What does Scalability and Reproducibility mean in the context of VA? Why is
it important to assess techniques according to those criteria?
Scalability: First we have v isual scalability. This means how well does the
visualization scale if we have a lot of data points. Scatterplots, for instance, can
hold 100th of data points in one diagram, but a pie chart is limited to 10-15. It is also
closely related to issues such as overplotting and visual clutter. Then we have
display scalability, which means how well can we use the visualization on different
display sizes.
Often we need to choose techniques that can support lots of data points because VA
deals mostly with large data sets.
Data:
Models:
Visualization:
Knowledge:
7. If we only have raw data, we often need to perform certain operations on it.
How is this step called? What is its purpose? What do we gain from it?
3
interpolate missing data.
In this step we can also do analytical computations to obtain aggregated data (eg
summary statistics), generate derived data like rate of change or create a hierarchy
(eg clustering).
For VA, we have a lot of tools at hand. Each of those tools have different parameters
and parameter combinations and can be combined or used alone. This gives us
great flexibility in choosing the right tools for the data at hand. But it also means the
analyst needs a lot of background knowledge how those tools work and how a
parameter influences the result. Especially for high dimensional data and large data
sets, it is not obvious what to choose to approach it.
We can therefore guide the analyst to help them choose a tool or fitting parameter
combination. This can be done by automatically computing statistical scores of the
data, give an overview of possible techniques, use estimated parameters based on
data size or give an overview of the data.
Flexibility is very important, as there are no techniques that perform well for every
data and we need to give the opportunity to compare different techniques. Flexibility
also gives us the opportunity to interact with the data. Guidance often gives a good
starting point and should not be too intrusive.
10. Describe the Rank-by-feature framework by (Seo, 2004) 1. Use the term
„interestingness“ measure. What are interestingness measures for individual
dimensions and pairs of dimensions?
The ranking is based on some interestingness measures. (In the whole paper the
word "interestingness" is only present once. The ranking criterias they use are not
called interestiness, so I don't know if he wants measures we learned about in the
1
https://www.cs.umd.edu/hcil/hce/presentations/seo_shneiderman_rff_ivs.pdf
4
class or the ones from the paper. I will list the ones from the paper.)
individual dimensions: Normality of the distribution, Uniformity of the distribution,
number of potential outliers, number of unique values, Size of the biggest gap
pairs of dimensions: Correlation coefficient, Least square error for curvilinear
regression, Quadracity, number of potential outliers, number of items in the region of
interest, Uniformity of scatterplots
11. Imagine that there are a couple of visual analytics systems, e.g. in finance or
for journalists. How can these systems be compared with each other? In other
words: How can we systematically evaluate visual analytics systems?
(I couldn't really find anything in the slides for this. So I'll just write what I
think is fitting. If anyone can point out the slide from the lecture, feel free to
edit).
VA systems should primarily be judged by how good they are to produce insight and
hypotheses, not based on usability or performance. A VA system should be able to
use different techniques and combinations of those to give flexibility but it should also
be able to give guidance.
Prof. Preim also said we can compare them on basis of questions asked eg. waht do
we want to find out.
Global Clustering
12. What is Clustering? Why would we need clustering?
Clustering helps us understand the structure of the data. We can use it to partition
data and as preprocess for classification or focus-and-context visualizations.
5
14. How would we decide what clustering method to use?
When the distribution of the data is unknown, we use c lustering with different
methods and parameters u ntil results are plausible. Selection of a clustering
method is b ased on assumptions about the data and a clustering model. The
clustering model can be chosen based on the distribution of data, the expected
shapes of clusters and their relations.
15. What different kinds of clustering can we perform? Explain them in about 1
sentence.
A fuzzy clustering only gives us a percentage of how likely it its an item belongs to a
certain cluster, while binary clustering partitions the data in non-overlapping parts
(hard clusters)
An ideal clustering method is scalable to many objects and dimensions, can deal
rbitrary shape, is robust against noise and outliers and creates
with clusters of a
plausible results.
19. What are the different clustering methods/paradigms? Explain their general
idea in 1-2 sentences and name an example for each of them.
distance model: Objects belong to a cluster if they are closer to a cluster center i
than they are to a cluster center j. k-means
6
density model: Objects belong to a cluster if their local density is higher compared
to the average density. DB-SCAN, OPTICS
hierarchy model: Clusters are assumed to exist at different levels. (We had no
named example, we only divided them in top-down (divisive) and bottom-up
(agglomerative))
istance model
k-means a clustering method based on a d
k-means p artitions the data set into k groups, this means we have to
approximate the number of clusters beforehand. It tries to find k centroids,
one for each cluster, which minimize the distance of associated data points.
Mathematically speaking, we want to minimize:
k
J = ∑ ∑ ||xj − μi ||2
i=0 xj ∈S i
2
https://stats.stackexchange.com/a/183213
3
https://en.wikipedia.org/wiki/K-means_clustering#Discussion
7
d. What are the parameters we can adjust for k-means?
distance metric: euclidean distance can be used for numerical data, but we
could also use city block distance or something else
yes
no
8
t least m points density connected, they
distance of ε away. If there are a
form a cluster. The other points are labeled outliers.
minimal number of points m: if this is too small, we potentially get a lot of
small clusters of points that should be outliers, if it is too high, we don't obtain
clusters that should be viewed as cluster
maximal distance ε : if too small, we only get clusters in very dense areas, if
too high, clusters get very big
e. In the following diagram, are A and B, A and D and A and C density
connected if the circles represent the maximal distance ε ?
9
A-B yes, A-C yes, A-D no
OPTICS
We have different linkage criteria that define when 2 clusters are merged. We
assume we have 2 clusters A and B.
single linkage: select pair of points (one from A, one from B) with minimum
distance to each other, merge clusters where this distance is minimal
sensitive to outliers, may result in long and thin clusters, supports arbitrary
shape, cannot separate clusters properly if there is noise between clusters
min{d(a, b) : a ∈ A, b ∈ B }
complete linkage: select pair of points (one from A, one from B) with
maximal distance to each other, merge 2 clusters where this is minimal
prefers spherical shapes, tends to split large clusters
max{d(a, b) : a ∈ A, b ∈ B }
average linkage: merge two clusters if the average distance of all pairs
(between those clusters) is minimal
1
|A|·|B| ∑ ∑ d(a, b)
a∈A b∈B
10
c. How does AHC perform based on our requirements for clustering
methods?
etermine an
Hierarchical clustering can, for instance, be used to d
appropriate cluster number for algorithms like k-means. It reveals
similarity relations between clusters and provides a level-of-detail
extraction for clusters
5
https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/
11
23. Mixed topics - Clustering
a. Most clustering algorithms need some kind of distance/similarity metric.
Give some examples how such a metric can be designed? Consider
that the individual dimensions may strongly differ in their range (in case
of scalar values) and in their data type, e.g. data may be categorical or
numerical.
We can define a distance metric for categorical data as well. Eg, we can say
the distance is 0, if objects have the same category and 1, if they have
different categories.
d. Why is it generally a good idea to show more than one cluster result?
e. What is multi-clustering?
12
count, size, density, clumpiness, number of outliers, shape, centroid position
25. Imagine we have a very large dataset with 10 dimensions we have no little to
no information about. What clustering algorithm would you suggest and why?
What preprocessing steps do you need to perform for your choice?
another answer may choose k-means because we have a large data set and high
dimensions. k-means is one of the fastest methods. noise and outliers can be dealt
with in preprocessing. k we can find in different ways
Elbow method
"Using the "elbow" or "knee of a curve" as a cutoff point is a common heuristic in
mathematical optimization to choose a point where diminishing returns are no longer
worth the additional cost. In clustering, this means one should choose a number of
clusters so that adding another cluster doesn't give much better modeling of the
data." 6
ithin-Cluster-Sum of Squared Errors (WSS) for d
Calculate the W ifferent values of
k, and choose the k for which WSS first starts to diminish. In the plot of
WSS-versus-k, this is visible as an elbow. 7
6
https://en.wikipedia.org/wiki/Elbow_method_(clustering)
7
https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb
13
Silhouette method
ilhouette coefficient f or different numbers of k and choose the k
Calculate the s
where it peaks.
Hierarchical Clustering
If we cluster hierarchical, we can step through the tree and visually decide what
depth is appropriate for the data set.
28. Given the following 2D datasets, which clustering method would you use to
cluster the data? Explain your answer.
(This also depends on the explanation although I would say it is often pretty
obvious which will give the best result)
a.
I would use OPTICS because we have 3 bars of different density. For
DB-SCAN it will likely be difficult to find a good global parameter and we
either have one big cluster or only the middle and the rest are outliers.
k-means might also work but only if the centroids are very well chosen.
k-means: DB-SCAN
14
b.
I would use k-means with a parameter of k=7 because it looks like we have
about 7 circular clusters. DB-SCAN would probably assign everything to one
big cluster or have several small clusters as the density is slightly varying, eg
in the middle.
k-means: DB-SCAN
(it is possible that it does not converge
to this solution though)
15
c.
I would use DB-SCAN because it looks like we would obtain 4 clusters of
higher density and the rest of the points are outliers (noise). k-means cannot
detect the ring cluster, because it does not allow non-convex / split clusters.
k-means: DB-SCAN
Subspace Clustering
29. Why would we do subspace clustering instead of global clustering?
Global clustering is limited to about <= 10-15 dimensions. The higher the
dimension, the sparser the data becomes. Sparse data has no or rarely any clusters.
The data is only similar on some dimensions. These dimensions form a subspace.
Also, noise, irrelevant data and highly correlated dimensions reduce quality of
global clustering.
30. What is the difference between subspace search and subspace clustering? Is
there an advantage of one over the other? Explain your answer.
(I think this is the same as the question: "Subspace clustering can be realized
16
in an integrated manner or in a decoupled manner. Explain these two
approaches." from the example questions)
Subspace search only searches for eligible subspaces, it does not perform
clustering in those subspaces. Subspace clustering does both in one algorithm.
Subspace search has an advantage. It is more flexible, less biased (as clustering
often works with assumptions) and more effective, since uninteresting subspaces are
filtered before the clustering step.
Subspace is not clusterable if the points have the same distance to all of its
neighbours. it is clusterable if there are differences in density. (I think SURFING
used this as measure)
33. What heuristics can we use to prune the search? Why is this useful?
17
34. Name and describe a subspace search algorithm.
a dense region is formed by a core object and all objects that are connected to it
We count all objects in dense areas in the subspace S = C OU N T (S)
the interestingness is then the number of objects in dense areas divided by the
volume of the subspace
I nterestingness = COU N T (S) / V OLU M E(S)
We can then rank the subspaces according to this interestingness measure. Higher
interestingness means higher relevance.
35. RIS
The following picture represents a 2D subspace.
8
https://www.dbs.ifi.lmu.de/Publikationen/Papers/PKDD03-RIS-final.pdf
9
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.5588&rep=rep1&type=pdf
18
a. Name all core objects when the circles represent the ε -Neighbourhood
of the objects and minPoints = 5. How many core objects would we
have if minPoints = 4?
36. What are the paradigms we can divide subsearch clustering into? Name an
example for an algorithm for each of them. Explain one in detail.
19
density threshold, biased towards lower dimensions
37. CLIQUE
Which cells would belong to a cluster if minPoints = 5?
A1, C1, C2
We can transfer categorical data into numerical data or use a frequent item mining.
We need a special normalization for categorical, continuous and hybrid data. Some
methods for categorical data use a density and frequency estimation.
To evaluate the clusters, we can use an artificially created data set to test if
algorithms return known clusters but we also need to test r eal world data as a
benchmark (as real world data is often not as well behaved as artificially created
data). The artificial data should also be of similar size and dimension as the real
world data.
valuation criteria like purity.
We can also evaluate the clusters based on cluster e
For real world data, it might be useful to get feedback of experts regarding
plausibility. A cluster should represent a concept and have relevance for decision
making.
20
We can also try how the clusters change if we change parameters in the algorithm.
Subspace clustering can deal with high dimensions, but m ore dimensions than 100
are too much information to be explored efficiently. We can push this limit by
including background information about the data in our subspace search (constrained
clustering). This is more scalable and improves accuracy. C lustering algorithms
only identify dense areas. Their primary goal is not to identify correlated
dimensions, patterns or outliers.
Visualization
41. What needs to be visualized to convey results of subspace clusters? Which
information you would consider important for an overview and which
information could be revealed on demand as interesting details?
42. Why do we need a visualization for subspace clustering especially? What are
some challenges that arise in subspace clustering visualization?
In the visualization, we want to see non-redundancy (we want clusters to not have
too many overlapping dimensions, not too many overlapping instances), coverage
21
luster characteristics
(are most instances and dimensions part of a cluster) and c
(size, compactness, dimensions involved).
parallel coordinates, scatter plot, heat map, linked views, clust nails
46. Name one advantage and one disadvantage (each) for visualizing subspace
clusters with a heat map and parallel coordinates.
10
https://scibib.dbvis.de/uploadedFiles/Hundetal2016Visualanalyticsforconceptexplorationinsubspa.pdf
22
. From the paper mentioned at heatmaps: "The MDS projection, however, can distort
the perception of similarities as in many scenarios there is no optimal 2D
representation of all pair-wise similarities which results in perceivable patterns which
are not given in the underlying data.
parallel coordinates: can show clusters of subspaces with colored lines in a parallel
coordinate view.
Advantages: contributing dimensions and members of a cluster well visible.
Disadvantage: parallel coordinates can look cluttered very fast. Visuals depend on
ordering of dimensions.
23
48. How can we visualize the relations between subspaces?
We can visualize the relations between subspaces with trees and graphs, but those
do not scale well to many subspaces. They may, however, be useful for zoomed in
portions (detailed views).
24
screenshot and googling brings you to a non-existent website.)
We can have a lasso selection, view detailed information about involved dimensions
(after selection), store relevant results, brushing to other (similar) subspaces. An
experienced user can add/remove dimensions.
Cluster Validation
51. By what means can we rate the quality of a cluster? What failure cases can
arise for clusters?
(Alternative formulation: "How can clustering results be evaluated? Consider
qualitative and quantitative aspects.")
A cluster can contain false positives (objects assigned that should not be part of a
cluster) and false negatives (objects not in the cluster that should be). False
positives render the cluster less meaningful as they do not contribute to the concept
the cluster should represent. False negatives mean that an essential structure was
not detected.
Silhouette coefficient
We have the clusters A and several cluster B i , and objects a ∈ A and b ∈ B i . We
1
then compute the average distance da = d(o, A) = |A|−1 ∑ d(o, a) of o ∈ A to other
a∈A
objects in A and the minimum average distance
1
db = min{ d(o, B i ) = |B i | ∑ d(o, b) | ∀ B i =/ A} to every other cluster B i . This means, if
b∈B i
25
The coefficient for the clustering is then
1
sc = nc ∑ ∑ S (o) where C is the set of clusters and nc = number of all objects in all
c∈C o∈c
clusters
(The slides are a bit confusing but are supposed to match the definition on Wikipedia
https://en.wikipedia.org/wiki/Silhouette_(clustering) )
centroid-based measure
For this method we compute the centroids of a cluster and compare the distance of
the cluster elements to the centroid of their own cluster to the distance to the centroid
of all other clusters. A centroid is the average of the positions of points in a cluster.
So ci is the centroid of cluster i, pi ∈ C i is a point in Cluster i.
Then if we have ∀pi dist(pi , ci ) < dist(pi , cj ) , the clusters are perfectly separated. For
a cluster obtained by k-means, this should automatically be the case.
The portion of points, where this inequality doesn't hold indicates the cluster quality.
To obtain a measure for the whole clustering result, we have sum weighted by cluster
size.
A downside of this measure is, that split or interwoven clusters give low values.
Narrow and curved clusters also produce low values and convex, compact shapes
give high values. The method is relatively robust against different sizes, densities and
clusters.
53. Given the following distance matrix for objects ai ∈ A, bi ∈ B and A and B
being clusters, compute the silhouette coefficient.
a1 a2 a3 b1 b2 b3
a1 0 1 2 3 6 5
a2 1 0 3 4 5 8
a3 2 3 0 6 7 6
b1 3 4 6 0 2 1
b2 6 5 7 2 0 1
b3 5 8 6 1 1 0
26
d(a1 , A) = (1 + 2)/2 = 3/2 , d(a2 , A) = (1 + 3)/2 = 2 ,
d(a3 , A) = (2 + 3)/2 = 5/2 , d(a1 , B ) = (3 + 6 + 5)/3 = 14/3 ,
d(a2 , B ) = (4 + 5 + 8)/3 = 17/3 , d(a3 , B ) = (6 + 7 + 6)/3 = 19/3
54. Given the following average distances for objects a, b, c coming from clusters
A, B, C respectively, compute the silhouette coefficient. Based on the
silhouette coefficient, is the clustering good?
c3 18 17 10
The clustering is not very good. The value is not close
to 1.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping
clusters. Negative values generally indicate that a sample has been assigned to the
wrong cluster, as a different cluster is more similar.
27
55. Given the following images, would a centroid based measure give a good or
bad purity measure for this clustering?
1 2 3
56. "We only need one purity measure to judge the quality of our clustering."
Do you agree with this statement? Justify your answer.
This statement is not true. A measure often prefers certain cluster structures or
depends on chosen parameters. A centroid based measure for instance performs
poorly for split clusters although they may be well separated. A grid based measure
would say they are fine, given the grid size was chosen accordingly. So one measure
does not fit every case which is why we want to look at multiple. If all say the result is
poorly clustered, we have a much higher certainty this is true than if only one says it
is poorly clustered.
a.
silhouette-coefficient: I'd say it is pretty good. The clusters are obtained by
k-means and are nicely separated. The intracluster distance should be good.
centroid-based: Same as with silhouette coefficient. Clusters are obtained by
28
k-means so they are guaranteed to be optimal regarding their centroids.
grid-based: Should give a good result. We have only a few cells that have
points of different clusters in it.
qualitative: The clustering looks good and seems to capture the underlying
pattern of multiple circles well.
b.
silhouette-coefficient: I'd say the silhouette coefficient does not give a good
result, mainly because the red points (the ring) are many and nearly always
have another cluster closer to them.
centroid-based: The centroid based measure would give a bad result
because of the red points. Nearly all of them are closer to another centroid.
grid-based: The grid-based measure would give a perfect result as we have
no mixed cells.
qualitative: I'd regard this clustering as good because it reveals the pattern of
a smiley-face very well and also regards outliers.
c.
silhouette-coefficient: I'd say the silhouette coefficient is not good for this
example. The clusters are close to each other and often there are quite small
ones, so the points at the cluster edges might be not closest to their own
cluster.
29
centroid-based: I'd say the centroid based method gives an ok result. The
clusters are roughly convex and the points classified as outliers give some
spacing.
grid-based: T he grid based measure gives a bad result for the clustering. We
have a lot of mixed cells. A smaller grid would improve the measure.
qualitative: I would regard this as a bad clustering result. The clusters do not
reveal the underlying circle structure, outliers seem random and plenty.
A larger grid size in b would lead to mixed cells and therefore a bad result for
the measure. This would mean all of the measures we used would say we
have a bad clustering. Still, if we look at the clustering visualization in the
scatterplot, we can see that the clustering is actually good and probably what
we would expect. So even if the quantitative measures fail, it does not mean
the clustering is necessarily bad. The visualization helps us to see that.
Cluster Visualization
58. What are the tasks we want to perform when visualizing clusters?
For a 2D clustering, we also want to show a distribution and statistical properties per
cluster, indicate what point belongs to which cluster and allow user selection. A fuzzy
clustering should also include the probability of a point belonging to a cluster. For
high dimensional data we want to preserve the distances in a projection so clusters
are perceived as such.
distance matrix
A distance matrix is an early technique. We calculate the distance of each object to
each other object and reorder the matrix based on that values. The cluster should
become visible.
30
11
glyphs
We can use glyphs to represent certain aspects. For instance we can color or
different shapes to represent membership to a cluster. A glyph should be consistent
and perceived as similar if the underlying data is similar. Glyphs can be combined
with other visualizations such as scatterplots
12
scatterplot
A scatter plot representation often requires some kind of projection to 2D or 3D. The
visualization depends on parameters of clustering and the projection technique used.
The distortion introduced through projection can make patterns appear that are not
inherent to the data. An animated scatterplot may serve well for a temporal
component.
13
parallel coordinates
DImensions are displayed as vertical lines, data entries are curves that intersect the
11
https://upload.wikimedia.org/wikipedia/commons/7/7a/Distance_matrix.PNG
12
https://res.cloudinary.com/practicaldev/image/fetch/s--hmMb2h34--/c_limit%2Cf_auto%2Cfl_progressi
ve%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/720/1%2ATEYPlUQfggUVnqu26Qi
Q3g.png
13
https://www.researchgate.net/profile/Xingwang_Zhao/publication/220604252/figure/fig2/AS:74685558
7651585@1555075656855/Scatter-plot-of-the-ten-cluster-data-set.png
31
dimension lines at the position that represents their value. Clusters can eg be color
coded. The visual presentation can be enhanced by varying opacity and edge
bundling.
14
isolines
Isolines can be used to show the degree of membership in fuzzy clustering.
enhanced dendrograms
Dendrograms can be used for hierarchical clustering. They summarize grouping. A
big downside is that they do not scale well and do not enable verification if a cluster
makes sense. An enhanced version can also take histograms of clusters into
account.
3D hierarchy visualization
These are also for hierarchical clustering. The clusters can be displayed as semi
transparent nested surfaces.
Outlier Detection
60. What characterizes an "Outlier" in the data? Why do we want to detect them?
An outlier is a data point that deviates strongly from all of its neighbours. They
are the extrema in a data set. What this means is highly context dependent. In high
dimensions, they are hard to identify because data is sparse anyway.
On the one hand, outlier removal can be beneficial for the analyst. Some algorithms
cannot deal with outliers and produce wrong or counterintuitive results. Outliers in
14
https://upload.wikimedia.org/wikipedia/en/4/4a/ParCorFisherIris.png
32
measured data are also often e rrors and do not convey useful information, so the
quality of the data gets better if we remove them.
On the other hand, outliers can be interesting points that do not stem from noisy
measurements. In that case, they can give new information. Imagine a person being
immune to a disease. They are definitely outliers in the data but can give information
on what gives an immunity (eg gene defects). So i f we remove them, we lose
valuable information.
This means we have to consider removing them very carefully and based on
assumptions if they belong in the data or not and based on the insight we want to
get. For example, the immune person might be interesting for finding a cure but not
for evaluating how a disease spreads.
depth based
Depth based methods define outliers as t he boundary of a distribution. This
method uses the convex hull to identify outliers. Points on the convex hull (depth 1)
are most likely outliers, we therefore remove them
distance based
Distance based methods define outliers as having an abnormal distance to its
neighbours. So if a point has fewer than α% of all points in its ε − neighbourhood.
density based
Density based methods define outliers as having abnormally low density in their
area. Simple techniques do not detect outliers reliably if clusters having different
densities are not clearly separated.
cluster based
Cluster based methods define outliers as n ot part of a cluster. Generally, clustering
algorithms are good at finding clusters, not outliers, so this should be used with care.
A lot of similar outliers could be considered a cluster.
63. In the following image, what points are outliers according to depth based
outlier detection?
33
the outer 3
65. What is the main criteria for validating outlier detection algorithms?
precision: If we have ground truth where we know how many outliers we have, we
can test if an algorithm gives us the same number and the same objects. If a dataset
(with M objects) has N outliers and an algorithm returns the top-N outliers: compute
the size of the intersection divided by N. If N << M, precision is likely low, whereas for
N <M it is better (with the same algorithm)
The intersection are the outliers that our algorithm found and that are also present in
the ground truth. If they are 100% the same, our algorithm is very precise. The
probability of our algorithm finding the same outliers is higher, if more outliers are
present. An algorithm will have it much harder if only 3 of 10,000 points are ground
truth outliers than if 300 of 10,000 points are outliers. So to judge the precision, we
will have to consider the ratio of N to M. For comparability between algorithms, we
subsequently have to choose the same conditions (preferably the same data set and
parameters).
Biclusters
66. What are Biclusters? What do they represent? What properties do they have?
This graph represents relations between two types of nodes, eg. people associated
34
with organizations. Biclustering is often done on c ategorical data. A biclustering
algorithm can produce a massive amount of biclusters (similar to subspace
ften redundant or overlap.
clustering). Those biclusters are o
67. To what type of data can biclustering be applied and which results occur?
Give an example for data to which biclustering can be applied?
Yes, biclusters can overlap because not the whole subspace is clustered but only a
subset of the subspace. These subsets can overlap for different biclusters.
69. What pre- and post processing might we perform when biclustering?
Postprocessing includes removal of biclusters that are too small, filter for clusters
with high quality and ranking them according to size.
70. Algorithms
Chained biclusters are biclusters that are related to each other by their rows or
columns. An example would be people related to organizations and organizations
related to locations.
35
15
Visualization
73. What tasks do we want to perform with a bicluster visualization?
74. What are requirements for a bicluster visualization? What are parameters for
initial generation?
Parameters for the initial generation or filtering of a biclustering are the (percentage)
minimum/maximum number of rows and columns, noise threshold for
numerical data and maximum overlap.
table based:
The table based approach shows rows and columns in a table. For up to 2 biclusters,
we can simply reorder the rows and columns to have a continuous representation of
the cluster. For >2 overlapping biclusters, we need duplicated rows/cols.
15
https://www.researchgate.net/figure/Chaining-four-biclusters-through-multiple-relations-by-approximat
ely-matching-sets-of_fig1_301697851
36
Duplications should be marked and do not scale well. Several algorithms compute
layouts that minimize duplications.
We can color biclusters with an individual color, but this does not scale well too.
Interaction in table based approaches includes selecting biclusters as focus or for
highlighting, sorting, labeling, zooming and enable/disable duplicates.
parallel coordinates:
In parallel coordinates, we can color lines by bicluster, but this approach does not
scale well and we have a problem with overlapping biclusters. We can also link the
visualization to a table based approach.
graph display/node-link:
A node-link representation can be used for displaying chained biclusters. They
represent a m:n: … :z relation, eg. the relation between patient, disease and
treatment. Each node is a bicluster and each edge is the link between biclusters.
Links width can be scaled by frequency
set based:
Set based approaches show the biclusters as subsets of bipartite graphs (remember
the definition). Naive representation has high visual clutter, which can be reduced by
inserting an abstraction in between. The abstraction can also be used to show
additional info, like how many items are involved. We can also use the length of the
bundle to show the size of the bicluster or small rectangles in the item list view
represent frequency in the dataset (e.g. in text documents, where words/names may
37
occur multiple times).
Interaction in set based approaches can be sorting and ordering and interactive
placement of the elements.
BicOverlapper
Overlapping for biclusters can also be displayed by transparent regions, which scales
better than a node-link diagram. We can search for nodes, highlight connections, fix
node positions and navigate through the graph.
A multi-class scatterplot is a scatterplot that also shows the class of the data
points ( eg what cluster they belong to). For instance we show the relation between
height and weight in a scatterplot and color the points according to if a person is male
or female.
38
analyze correlations
detect outliers
analyze clusters
comparison of data on the same axes
79. Discuss overplotting and visual clutter and how it can be reduced for
scatterplots.
metrics of visual clutter: screen space statistics (number of used pixels, number of
free pixels, collisions)
item number, redundancy, grouping, contrast, saliency ("Heraussragen")
distortion: We can use distortion to reduce clutter by increasing screen space for
dense regions and decreasing screen space for sparse regions. They are more
difficult to interpret than undistorted visualizations → need interaction to interpret
them and distortion should be adjustable.
80. Is linear scaling always the best to use? Discuss. What else do we have to
look out for when scaling the scatter plot? Sketch examples for badly scaled
scatter plots.
Linear scaling of the axes is the best to easily interpret the data set. But linear scaling
can lead to large sparse regions, so to use the screen space efficiently, we can also
scale axes differently. Log-scaling or square root scaling are also used often and are
understood well enough.
We also have to consider outliers. They can cause the scatterplot to scale awkwardly
if they are far away from every other data point. Generally, a scatterplot should be
scaled in a way that the data has some distance from the frame but isn't crammed in
a proportionally small part of the domain shown.
3D scatterplots show a distribution of 3 variables. They try to exploit the human visual
system that can understand spatial relations well, but points are very hard to
perceive in their spatial relation as we are missing depth cues. It also suffers
from occlusion problems and a mentally demanding interaction. That's why it
39
mostly preferred to show the combination of each axis separately.
82. Explain the idea of SPLOM and GPLOM. What does a GPLOM do better?
Where do you see scalability limits?
(related question: Imagine that your data is mixed with some dimensions
being numerical and others categorical. How can the scatterplot matrix be
extended to display such data.)
16
17
16
https://miro.medium.com/max/1400/1*C-BCaajZWvSAujSWeBAZBQ.png
17
https://www.researchgate.net/profile/Jean_Francois_Im/publication/256837289/figure/fig4/AS:409148
761100291@1474560075349/Example-plots-extracted-from-a-matrix-generated-with-the-gpairs-pack
age-in-R-7-Top.png
40
83. Explain 2 advanced scatter plot techniques.
Concentration Ellipses
Concentration ellipses are an abstract depiction of a multiclass distribution. Multiple
semi transparent ellipses with solid borders are fitted to the data points of a class.
Outliers may be excluded. To properly explore the data, interaction is needed as the
ellipses are only an abstract depiction.
18
Binning
With binning, we abstract the data at hand. Multiple points are aggregated into
groups.
There are several variants on a binned scatter plot. We can eg bin by the groups
iscrete density plot) , a
used in histograms, by a rectangular grid of the domain (d
exbin) on the domain or by adaptive ranges. The grid based
hexagonal grid (H
ensity plot.
approaches are also called d
19 20 21
Splatter Plot 22
Splatter plots are similar to concentration ellipses but they can have arbitrary shape.
They are based on a density calculation over the whole domain. The edge of the
shape is an isoline matching a density threshold. The colors for data points are
blended based on density. The resulting polygons are smoothed. To avoid distracting
outliers, sparse regions are filtered and sampled.
18
https://lh3.googleusercontent.com/proxy/CDCpdo7ae1el2fqujpM9zpPHmM_6qvNnSq4wdr9-42Sl7ZO
TerKMq2bBpnqjO-LCRUKNwZ3sW5moTnJAYWCABNdCnSxV55Re1xAUkA-62OPPovDWd4-y5qF6
VUpgeCkJgL1SKLJniHt8MrM8ZiBttsmcLqm4dxcN_utOhUM_LSUd3hU8qRwpOmXW75uylA
19
https://doc.dataiku.com/dss/latest/_images/grouped-scatter.png
20
https://www.mathworks.com/help/examples/matlab/win64/BinScatterPropertiesExample_01.png
21
https://datavizproject.com/wp-content/uploads/2015/11/Sk%C3%A6rmbillede-2016-01-28-kl.-10.56.25
.png
22
https://graphics.cs.wisc.edu/Papers/2013/MG13/splatterplots-final.pdf
41
Generalized scatterplot
A generalized scatter plot deals with distortions. The user can move a slider to go
from an undistorted view that may contain overlapping and overdrawing to an
distorted but overlap-free view. Here, the interaction is essential to understand the
data.
The algorithm to obtain the distorted view follows an iterative approach. Each data
point is added one after another. If the space, where the point is supposed to go has
another point already, the new point is displaced a little bit. We do this until all points
are added.
42
87. Describe measures to compare two or more scatterplot-based
representations.
High dimensional data is hard to analyze and visualize. Eg. a 10D data
set would result in 45 scatter plots in a SPLOM. In high dimensions, it is also
nearly impossible to find global clusters.
feature selection: remove dimensions and only keep a subset of the original
dimensions
43
scientists are not satisfied
Progressive DR is part of progressive visual analytics and is based on the idea that
the user may start the analysis on intermediate results in case the full
computation takes too long. So, PDR reduces only a subset of the points at first.
The user can then already adjust parameters in case the projection is poor without
waiting for the full result. This speeds up the process of analyzing.
23
https://nicola17.github.io/publications/2016_AtSNE.pdf
44
Linear Techniques
93. What is the idea of linear dimension reduction (LDR) techniques? What data
is suitable for those methods?
LDR techniques generate a new set of dimensions where each new dimension is
a linear combination of the others. LDR is suitable for normal distributed data. It
does not scale well for very HD data and has a limited degree of freedom.
score plot: scatterplot of the data for the largest 2-3 PCs
24
https://setosa.io/ev/principal-component-analysis/
45
25
26
scree plot: shows variance (eigenvalues) for each PC. The scree plot is
missing a lot of information eg the influence of the original variables on the
PC
27
25
https://support.minitab.com/de-de/minitab/18/principal_components_loan_applicant_score_plot.png
26
https://support.minitab.com/de-de/minitab/18/principal_components_loan_applicant_loading_plot.png
27
https://upload.wikimedia.org/wikipedia/commons/a/ac/Screeplotr.png
46
analysis. It is also challenging to interpret and needs preprocessing.
PCA
PC1 = a * dim1 + b * dim2 + c * dim3 + …
FA
dim1 = x * factor1 + y * factor2 + z * factor3 + ...
In PP, each d imension is associated with an index that describes how well
the dimensions separates the data. The resulting projection reveals clusters
very well. It can also be applied to individual clusters (eg results of global or
subspace clustering).
28
https://towardsdatascience.com/interesting-projections-where-pca-fails-fe64ddca73e6
47
b. Sketch (in the sense of roughly describing) how the algorithm works.
Yes, it preserves clusters very well and is a powerful tool for exploratory data
analysis. It also is robust against noise and efficient.
Non-linear Techniques
97. What is the idea behind non-linear projection techniques?
Non-linear techniques allow m ore degrees of freedom and are suitable also for
skewed and multimodal distributions. They try to preserve small distances
because small distances are often more interesting than large distances.
We can use MDS for text analysis, force directed layouts or to show results of
subspace clustering.
48
c. What is the advantage of the variant "progressive steerable MDS"?
SNE has the problem that it projects a lot of points into the center. This leads
to crowding. t -SNE uses a student-t distribution instead of a gaussian and
evens out the density. It therefore solves the crowding problem.
SNE and t-SNE preserve clusters. t-SNE expands dense clusters and
contracts sparse clusters, so it is not suitable to compare relative cluster
size and relative distances between clusters.
Projection Quality
101. Error
a. By what means can we analyze the projection error?
49
How is the error spread?
Where are the points located that have a high difference in distance between
projection and HD?
How does parameter choice affect the error?
A lot of DR techniques optimize some error function → can use those as error
metric. Eg preservation of distances, overlap between k-nearest neighbours,
agreement in ranking between k-nearest neighbours, correlation of pairwise
distances
A visualization should provide cues about local reliability in the structure directly. We
can show it per point or per region (eg as color in the background of the scatter plot).
The type of error should be conveyed as well.
Examples are Voronoi Cells (their drawback is, they also show error in areas without
points), false neighbour and missing neighbour view 29 and stress maps 30.
104. What is the problem when the user wants to compare different techniques
interactively? How can we decrease it?
The user might want to explore and compare different DR techniques with multiple
parameters. Some DR techniques are very slow, so interactivity is difficult
performance wise. We can still provide interactivity by using only approximative
solutions, use multigrid and multisolver techniques and enabling GPU-support.
29
https://www.semanticscholar.org/paper/Visual-analysis-of-dimensionality-reduction-quality-Martins-Coi
mbra/9e7105b77093b1947637040034b0d7b80ab35a20
30
https://www.semanticscholar.org/paper/Stress-Maps%3A-Analysing-Local-Phenomena-in-Reduction-
Seifert-Sabol/15ecef22524c2cb650fed2357e0d0b7feefe1625
50
I guess the next 2 questions answer this as well.
DR is a very automated process. The user may choose the algorithm, some of its
parameters (eg what distance metric to use (MDS) or perplexity and noise (SNE).
107. How does dimension reduction through user defined quality metrics work?
What quality metrics could we use? Explain 2.
We can for instance use outlier preservation rate as an error metric. The algorithm
will try to preserve outliers when reducing dimensions. For each pair in several
dimensions (2D, 3D, …) it is recorded for which dimension selection a point is
considered an outlier.
Another one is cluster dominance. The algorithm tries to remove only dimensions
that do not contribute to subspace clustering. Loss of information relates to the
dimensions removed that contribute to clusters.
"Seek a view" helps the user to find interesting subspaces. For outliers, negative
correlations and other things, more than one subspace can be relevant.
It is a combination of visual representations (views) used to interpret subspaces.
Filtering enables reduction for iterative refinement.
31
https://www.researchgate.net/publication/220778426_Visual_Hierarchical_Dimension_Reduction_for_
Exploration_of_High_Dimensional_Datasets
51
109. Why is guidance so important for dimension reduction tasks?
111. What are the components of a Decision Tree? Make a sketch of a DT and
label them.
112. Algorithm
a. Explain steps of the algorithm to build a decision tree. (Other
formulation: Explain the basic tree induction algorithm.)
32
https://eprints.cs.univie.ac.at/4209/1/vast10_dimstiller.pdf
52
- if: all samples of current node have same label c → leaf node of class c
- else: select splitting attribute s that is "most useful" to support a decision →
make a decision node and branches for s →
split dataset according to s
- recursively repeat until stopping criteria reached
entropy 33:
The entropy of the set of items in the current node S is given by
k
E ntropy(S) = − ∑ pi · lg(pi ) where pi is the probability of occurrence of class i.
i=1
The entropy "measures" chaos. Uniform probability (meaning same chance
for each class) yields maximum uncertainty and therefore maximum entropy.
So the entropy decreases if our node reaches a homogenous class.
information gain:
The information gain is then
|S v |
I G(S, A) = E ntropy(S) − ∑ |S| Entropy(S v ) where S is the set of items at the
v∈dom(A)
gini index:
33
https://en.wikipedia.org/wiki/Entropy_(information_theory)#Definition
53
The gini index is defined as
k
Gini(S) = 1 − ∑ pj2 . It describes how often a randomly chosen element from S
j=1
would be incorrectly labeled. The gini index is similar to entropy but does not
have the expensive logarithmic operation. 0 for the gini index means perfect
split.
gini gain:
The gini gain is defined similarly to the information gain, but uses the gini
index instead of entropy. We want to maximize this again.
|S v |
GG(S, A) = Gini(S) − ∑ |S| Gini(S v )
v∈dom(A)
One big flaw of DTs is the p otential overfitting to training data. Small
changes in the input data can yield a very different DT. That is why validation
is very important.
Cross validation is the standard method for validation. The training data gets
partitioned into k distinct subsets. We use k-1 of those to train the tree and
the k-th to validate if the tree classifies unknown data correctly. We cycle
through all k subsets and use the tree with the highest accuracy.
b. Write down the formula for Effectiveness of a DT. What two criteria
does it combine? Explain them.
54
to zero and EDT is equal to the accuracy of the model. 34
We can reduce the complexity by preprocessing the data before creating the
DT or modify the DT. A u
ser steered process with domain experts can also
reduce the complexity.
Preprocessing can be feature selection, outlier removal and clustering
(making 1 DT per cluster).
After the DT was constructed, we can reduce the complexity by pruning or
merging splitting values or rounding off splitting values to enhance
interpretability.
I also made a mistake while calculating it, so the split criteria for the root
and therefore for the rest of the tree is different than in the solution
provided. The general algorithm is the same though.
34
https://www.sciencedirect.com/science/article/pii/S1071581906001078 (access via university login)
55
● Gini Index: GI(S) = 1 − (pout ² + pstay ²)
|S v |
Gini Gain: GG(S, A) = GI(S) − ∑ |S| GI(S v )
v∈A
● Root node
|S| = 14, p_out = 6/14, p_stay = 8/14, GI(S) = 24/49
○ Rain
Rain <30%, |S| = 7, p_out = 5/7, p_stay = 2/7, GI(Rain <30%) = 20/49
Rain >30%, |S| = 7, p_out = 1/7, p_stay = 6/7, GI(Rain >30%) = 12/49
GG(S, Rain) = 24/49 - (1/2 * 20/49 + 1/2 * 12/49) = 8/49
○ Humidity
Humidity low, |S| = 4, p_out = 2/4, p_stay = 2/4, GI(low) = 1/2
Humidity medium, |S| = 7, p_out = 4/7, p_stay = 3/7, GI(medium) =
24/49
Humidity high, |S| = 3, p_out = 0, p_stay = 1, GI(high) = 0
GG(S, humidity) = 24/49 - (4/14 * 1/2 + 1/2 * 24/49 + 0) = 5/49
○ Wind
Wind soft |S| = 8, p_out = 4/8, p_stay = 4/8, GI(soft) = 1/2
Wind strong |S| = 6, p_out = 2/6, p_stay = 4/6, GI(strong) = 4/9
GG(S, Wind) = 24/49 - (8/14 * 1/2 + 6/14 * 4/9) = 2/147
○ Temperature
<10*C |S| = 5, p_out = ⅕, p_stay = 4/5 , GI(<10C) = 8/25
>10C and <35, |S| = 7, p_out = 4/7 , p_stay = 3/7 . GI() = 24/49
>35C |S| = 2, p_out = 1/2 , p_stay = 1/2. GI(>35C) = 1/2
GG(S, Temp) = 24/49 - (5/14 * 8/25 + 1/2 * 24/49 + 2/14 * 1/2 ) =
29/490
56
Rain gives us the highest Gini Gain, so we will split our dataset
according to rain probability first
● Rain <30%
|S| = 7, p_out = 5/7, p_stay = 2/7, GI(Rain <30%) = 20/49 (we already
calculated that)
○ Humidity
low, |S| = 4, p_out = ½, p_stay = ½, GI(low) = 1/2
medium, |S| = 3, p_out = 1, p_stay = 0, GI(medium) = 0
high, |S| = 0
GG(S, Humidity) = 20/49 - (4/7 * ½) = 6/49 = 0.122
○ Wind
soft, |S| = 4, p_out = ¾, p_stay = ¼, GI(soft) = 3/8
strong, |S| = 3, p_out = ⅔, p_stay = ⅓, GI(strong) = 4/9
GG(S, Wind) = 20/49 - (4/7 * 3/8 + 3/7 * 4/9) = 1/294
○ Temperature
<10*C |S| = 3, p_out = 1/3, p_stay = 2/3 , GI(<10C) = 4/9
>10C and <35, |S| = 3, p_out = 1 , p_stay = 0 . GI() = 0
>35C |S| = 1, p_out = 1 , p_stay = 0. GI(>35C) = 0
GG(S, Temp) = 24/49 - (3/7 * 4/9 ) = 44/147 = 0.299
● Temperature <10*C
|S| = 3, p_out = 1/3, p_stay = 2/3 , GI(<10C) = 4/9
○ Humidity
low, |S| = 3, p_out = ⅓, p_stay = ⅔, GI(low) = 4/9
medium, |S| = 0
high, |S| = 0
GG(S, Humidity) = 0
○ Wind
soft, |S| = 2, p_out = ½, p_stay = ½, GI(soft) = 1/2
strong, |S| = 1, p_out = 0, p_stay = 1, GI() = 0
GG(S, Wind) = 4/9 - ⅔*½ = 1/9
57
so we can also leave this out. We need to assign a label to the
leaf node but the result is still mixed. We have no majority, so we
can just decide which label we want to use. I will use "stay
home".
And because we also have "stay home" in the other branch, we
can also leave out the split for wind and make it a leaf node.
● Wind strong
|S| = 1, p_out = 0, p_stay = 1, GI() = 0
is leaf → stay home
● Temperature >10C and <35C
|S| = 3, p_out = 1 , p_stay = 0 . GI() = 0
is leaf → go out
● Temperature >35C
|S| = 1, p_out = 1 , p_stay = 0. GI(>35C) = 0
is leaf → go out
● Rain >30%, |S| = 7, p_out = 1/7, p_stay = 6/7, GI(Rain >30%) = 12/49
○ Humidity
low, |S| = 0
medium, |S| = 4, p_out = 1/4 , p_stay = 3/4 , GI(medium) = 3/8
high, |S| = 3, p_out = 0, p_stay = 1, GI(high) = 0
GG(S, Humidity) = 12/49 - (4/7 * ⅜) = 3 /98
○ Wind
soft, |S| = 4, p_out = ¼, p_stay = ¾ , GI(soft) = 3/8
strong, |S| = 3, p_out = 0, p_stay = 1, GI(strong) = 0
GG(S, Wind) = 12/49 - (4/7 * ⅜) = 3/98
○ Temperature
<10*C |S| = 2, p_out = 0, p_stay = 1 , GI(<10C) = 0
>10C and <35, |S| = 4, p_out = 1/4 , p_stay = 3/4 . GI() = 3/8
>35C |S| = 1, p_out = 0 , p_stay = 1. GI(>35C) = 0
GG(S, Temp) = 12/49 - (4/7 * 3/8 ) = 3/98
All of them have the same Gini Gain → choose one. (I will use
Temperature).
● Temperature <10*C
|S| = 2, p_out = 0, p_stay = 1 , GI(<10C) = 0
is leaf → stay home
● Temperature >10C and <35C
|S| = 4, p_out = 1/4 , p_stay = 3/4 . GI() = ⅜
○ Humidity
low, |S| = 0
medium, |S| = 2, p_out = ½, p_stay = ½, GI(medium) = 1/2
high, |S| = 2, p_out = 0, p_stay = 1, GI(high) = 0
GG(S, Humidity) = ⅜ - (½ * ½) = 1/8
○ Wind
soft, |S| = 1, p_out = 1, p_stay = 0, GI(soft) = 0
58
strong, |S| = 3, p_out = 0, p_stay = 1, G(strong) = 0
GG(S, Wind) = 3 /8
● Temperature >35C
|S| = 1, p_out = 0 , p_stay = 1. GI(>35C) = 0
is leaf → stay home
35
https://towardsdatascience.com/understanding-random-forest-58381e0602d2
36
https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991
59
Advantages of a decision tree are easy interpretation, relatively easy visualization,
perform well on large data sets and are fast. On the downside they are easily
overfitted to training data and need to be optimal (which is difficult as they can vary
greatly depending on the training data).
A random forest can be much more accurate because the end result is an
aggregation of multiple low correlated decision trees. They are harder to visualize
and are slower, because they are essentially a bunch of decision trees.
Visualization
116. Next to the DT itself, what is profitable to also show in the visualization?
The DT itself only shows the result of the DT process. To really gain an
understanding of how the decision was made, it is also beneficial to show training
data and its statistics. If the real world data does not follow similar distribution as the
training data, it might explain poor decisions.
117. What are the requirements we have for a decision tree visualization?
The DT should be first displayed in its entirety for an overview and subtrees should
be selectable for a detailed view. Leaf nodes and splitting nodes should be easily
distinguishable and leaf nodes should have the class label visible. Ideally we also
can see the distribution of the splitting attribute at the splitting node or a linked,
synchronized view. At all times, the accuracy should be updated and presented.
As always, we want to perform panning, zooming and requesting details. The user
might also interact with the tree by merging, splitting or deleting decision nodes.
119. Name the basic techniques to visualize a DT. Explain 2. Which one is the
best (according to studies)?
The outline view / indentation diagram is basically how a folder-file structure looks
like. It can be combined with expanding/collapsing of subtrees.
A node-link diagram is the easiest to understand but is not efficient screen space
wise. Enhanced node-link diagrams can also display distributions and number of
involved items at each node.
The treemap is very efficient space wise but difficult to understand regarding the
decisions made. A variant is the tree ring, which uses a radial layout.
60
The icicle plot is more efficient space wise than a node-link diagram but the
hierarchy is better perceived than for a tree map. Of all basic techniques, it has the
best trade-off between screen-space efficiency and interpretability.
37 38
39 40
120. What are advanced techniques to visualize a DT? Explain one in detail.
Node-link diagrams can be improved by Bezier curves for edges, histogram of the
target variable and collapsed subtrees. The width of the edges can be used to show
how frequent a path was chosen. Details can be shown on mouse over.
Icicle plots could show info of the training data but it does not scale well.
37
https://img1.daumcdn.net/thumb/R800x0/?scode=mtistory2&fname=https%3A%2F%2Ft1.daumcdn.n
et%2Fcfile%2Ftistory%2F204DFC284AF572570A
38
https://miro.medium.com/max/3840/1*jojTznh4HOX_8cGw_04ODA.png
39
https://images.squarespace-cdn.com/content/v1/55b6a6dce4b089e11621d3ed/1528204277811-JX4H
T3U2578DXA5CIW7O/ke17ZwdGBToddI8pDm48kPHmLxVe8SfwV-YoKPCx7JMUqsxRUqqbr1mOJY
KfIPR7LoDQ9mXPOjoJoqy81S2I8N_N4V1vUb5AoIIIbLZhVYxCRW4BPu10St3TBAUQYVKct9wL8Tz
yCYlAdUTfmg9wVFcML89r8uCmInwS8AiaUqBgfJJEHi9xFhV3nuSB_8WT/Treemap-with-measure-na
me-labels.png?format=750w
40
https://ars.els-cdn.com/content/image/1-s2.0-S1071581906001078-gr5.jpg
61
The user defines a hyperplane for splits based on a scatterplot matrix. The initial line
can be automatically computed with support vector machines.
confusion matrix
the confusion matrix indicates how often a class label was confused with another. It
may reveal patterns in misclassification.
we can also project misclassification into the spatial domain to find clusters of
misclassification.
A visualization should make it easy to identify the tree topology, node relations
and leaf size. It should also be able to adjust the layout to user preferences.
Intertopic Questions
124. Explain the relation between dimension reduction and subspace clustering.
62
dimensionality, so we can very likely remove it from the dataset.
125. Explain the difference between global clustering, subspace clustering and
biclustering.
global clustering: all data points, all dimensions, (typically) non-overlapping clusters
subspace clustering: all data points, selected dimensions, overlapping subspaces
biclustering: selected data points, selection of 2 dimensions as feature vectors,
overlapping clusters, applied to 2D data and the results are restricted to rectangular
shapes
all of them can be performed on categorical or numerical data, but global and
subspace clustering is mostly done on numerical data, while biclustering is done on
categorical data
126. Select one example for a problem and describe the high-level design of a
visual analytics system to tackle it.
Grid based approaches are always dependent on grid size. it is difficult to find a
fitting global grid size. To overcome this, one can use an adaptive grid size.
63
Algorithms
64
Really cool visualization of clustering:
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
65