Professional Documents
Culture Documents
Collection Comparison
Bachelor thesis
Credits: 18 EC
University of Amsterdam
Faculty of Science
Science Park 904
1098 XH Amsterdam
Supervisor
Dr. N.J.E. van Noord
Informatics Institute
Faculty of Science
University of Amsterdam
Science Park 904
1098 XH Amsterdam
Semester 1, 2022
Abstract
When navigating online museum collections, a user first needs to know
which specific collection they want to go through before they can submit
their query, which is not ideal for exploring. A solution would be merging
all the collections, but this leads to a heterogeneous collection. So in this
thesis, a data-driven approach is taken to compare museum collections and
find similarities to help combine collections by using computer vision. First,
experiments are done on a well-annotated museum dataset by extracting
high-level features from images of art pieces using a neural network and then
clustering these features using k -Means with different values for k. Two mod-
els which are pre-trained on an image dataset are tested, namely a ResNet18
and a Vision Transformer (ViT), and are also fine-tuned by further training
them on the well-annotated museum dataset. After quantitative and qualita-
tive evaluation, the pre-trained and fine-tuned ViT model with k = 10 work
best and is applied to the case study, namely the Allard-Pierson Museum
dataset. This shows that the ViT model is indeed suitable for extracting
information from visual museum data, but that the number of clusters that
represents the entire museum collection is museum specific.
Contents
1 Introduction 1
2 Theoretical Foundation 3
2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Residual Neural Network . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Vision Transformer . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 k -Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Clustering Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Davies-Bouldin Index . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Silhouette Coefficient . . . . . . . . . . . . . . . . . . . . . . 6
3 Related Work 6
3.1 Summarising Visual Data . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Analysing Artwork . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Method 8
4.1 Extract Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Experiments 10
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.2 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.1.3 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . 11
5.1.4 Model Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2 Analysing Similarities . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.3 Training of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7 Conclusion 23
1 Introduction
To accumulate data, archives are the longest standing collective effort amongst
humans. They are carefully curated by archive scholars who follow protocols con-
cerning ethics, inclusivity, transparency, consent, and privacy. According to Jo
and Gebru (2020), machine learning datasets often lack the good qualities that
come with following these protocols, resulting in negative consequences such as
bias. They argue that a new specialisation needs to be established that regulates
machine learning datasets equally to how archives are regulated, so that the good
qualities of archives are ensured in these datasets as well. This would suggest
that machine learning datasets have a considerable amount to learn from archives.
However, this well-managed structure in archives might not be ideal in all cases.
Museum collections are managed by curators with specific expertise which can be
compared to the format maintained in handling archives and when these collections
are made available online, the rigid divisions remain. This results in a user having
to know which specific collection they want to explore before they can submit their
query. The shortcoming of this system is highlighted by Mitchell Whitelaw, who
argues that a search query is a compromised and imperfect expression of a feeling
that arises as a “vague dissatisfaction” (Whitelaw, 2015). Thus, the current system
in which online collections are set up does not make them particularly accessible
when exploring.
A possible solution would be merging all the collections, but this would result in
a heterogeneous collection. Another solution would be a data-driven approach to
compare collections and find similarities to help combine collections. The latter
can be done using computer vision, a field in Artificial Intelligence that focuses on
the analysis and processing of visual data by computers.
A particular collection can consist of paintings, maps, and armors which are very
different in style and/or medium, but which have been grouped by a curator with
a specific motivation. By using computer vision to find similarities between art
pieces, interesting new relationships can be found between works of art that may
not have been apparent at first glance. This method would allow the user to
first browse through these collections of art pieces and when selecting a work that
piqued their interest, they are able to further explore through the more finely
curated story line made by experts.
1
clusters. This will be developed while answering the research question, which reads
as follows: How can museum collections be grouped in a data-driven manner?
1. What neural network works well to extract high-level features from museum
visual data?
With the aim of answering these questions, a theoretical background will be given,
followed by an overview of related works in summarising visual data and analysing
images with a focus on art pieces. These sections will form the foundation for the
process outline given in the methodology. Then, this method will be applied in
order to experiment on a museum dataset which is accompanied with annotations
in the next section. After the experiments are done, a neural network will be chosen
and the clustering properties will be decided upon which then will be applied on a
case study of the Allard-Pierson Museum dataset; a museum that consists of the
heritage collections of the University of Amsterdam. Lastly, the results will be
discussed and a conclusion about how the museum collections are grouped will be
made.
2
2 Theoretical Foundation
2.1 Neural Networks
For the purpose of extracting information from the visual data of the museum
by applying computer vision, a neural network is required. Two different neural
networks are going to be explored, namely a Residual Neural Network and a Vision
Transformer, since they proved to be useful in the art domain. In Section 3 a
further elaboration will follow on their use in the art domain but first they are
briefly explained in this section.
3
(Vaswani et al., 2017). This architecture is built on a principle called Attention
which entails comparing each token in a sequence with each other token in that
same sequence and therefore allows a better understanding of the connection be-
tween tokens. In NLP this sequence is a piece of text, e.g. a sentence, but in a ViT
this sequence is an image divided into patches. The operation of comparing tokens
is done in multiple heads, called Multi-Head Attention. The Multi-Head Attention
layer is, in combination with a normalisation layer and forward feed layers, called
a Transformer Encoder Block. Multiple of these Blocks can be stacked to create
a Transformer Encoder which is then followed by a Multilayer Perceptron (MLP)
head to perform classification. The full implementation of the architecture can be
seen in Figure 2.
4
k X
X
arg min ||x − µi ||2 (1)
C
i=1 x∈Ci
This set is found by randomly instantiating k centroids and assigning each data
point to the cluster with the closest centroid. Then, a new centroid for each cluster
is calculated and the data points are reassigned. The algorithm converges when
no data points are reassigned to a different cluster; this method does not assure
that the optimal solution will be found but can be optimised by repeating the
algorithm and averaging the results.
The Davies-Bouldin Index is defined as the average of all the Sij where the clusters
i and j are most similar and i ̸= j. This is represented by equation 2:
k
1X
DBS = max Sij (2)
k i=1 i̸=j
5
2.3.2 Silhouette Coefficient
Another metric that can be used when the underlying clusters are unknown is
the Silhouette Coefficient (Rousseeuw, 1987). This score is bounded between -1
and 1 where a low score is given to incorrect clustering, scores around zero signify
overlying clusters, and a positive score is given to dense clusters. The Silhouette
Coefficient for a set of n samples called N is given by equation 3:
1 X bx − ax
SS = (3)
n x∈N max(ax , bx )
In this equation, ax represents the mean distance between sample x and all the
samples in the same cluster, and bx represents the mean distance between sample
x and all the samples in the second closest cluster.
3 Related Work
3.1 Summarising Visual Data
The grouping of artworks is central to this research, and grouping a large amount
of images is not a novel problem in the field of computer science. As the amount
of visual data increases, so does the need to group and summarise it in order to
better understand it.
In order to summarise and explore visual data, a possible method is using SIFT
keypoint detection to generate descriptors which hold information about scale, ori-
entation, and location (Lowe, 2004). Generated descriptors from different images
can be compared, in distance for example, to find similarities between images and
then be used to e.g. explore images in 3D or create graphs by linking images
(Heath et al., 2010)(Snavely et al., 2006). An unsupervised method to finding
patterns within a large collection of images is data mining, where connections
within the data are made by inspecting which two patterns often occur together
(Rematas et al., 2015)(Yuan et al., 2007). According to Rematas et al., larger
semantically significant areas within images are discovered and linked with other
images through this method, allowing a user to go through images from one pattern
to another in a semantically purposeful way. This thesis builds on the fact that
there are patterns present in images which can be linked to each other and thus
make meaningful connections. This would mean that there are also semantically
meaningful patterns in images of works of art which are to be linked by clustering
in this thesis.
6
Furthermore, there are approaches that do not use patterns or keypoints to find
similarities but that use metadata and/or individual pixel data from images, such
as going through each pixel in an image, keep count of how often a color appears,
and use this pixel data to compare images (Van Leuken et al., 2009). Images can be
provided with metadata, such as their capture location, and then be clustered on
this geographical location (Jaffe et al., 2006), and when more metadata is available,
such as short text labels and capture time, these too can be taken into account
when clustering or comparing (Kennedy and Naaman, 2008). Lastly, another way
to handle large amounts of visual data is discussed in a paper by Sinha et al.
(2011), specifically how to give a good summary for a large collection of personal
images, i.e. a photolog. A good summary has to meet three requirements: quality,
diversity, and coverage, whilst also complying with a users information need. The
three requirements are calculated using pixel features and metadata, and the set of
images that maximises these demands is found by performing a greedy search and
chosen as summary of the photolog. This paper demonstrates that image metadata
are useful features in summarising visual data and although there is more focus on
finding patterns in this thesis, metadata will be used in order to understand the
created clusters better.
Research shows that features extracted from a neural network trained for a different
task work better than other low-level features when classifying the style of an
image (Karayev et al., 2014). This is called transfer learning, which is a technique
in which a pre-trained model of a given domain assists a task in a different domain.
Then, this model can be fine-tuned on an art dataset by retraining certain layers
of the model to further improve the model’s proficiency in the art domain (Milani
and Fraternali, 2021).
7
e woodblock prints (Khan and van Noord, 2021). A ResNet model (He et al., 2015)
pre-trained on the ImageNet dataset (Deng et al., 2009) was compared to a ViT
model (Dosovitskiy et al., 2020) pre-trained on the larger ImageNet-21K dataset
(Ridnik et al., 2021), and there was experimented with completely and partially
freezing the models, in which the partially frozen ViT model was superior. Partially
freezing allowed the model to learn some task-specific features but made the first
layers of the model behave somewhat as a feature extractor.
Lastly, transfer learning was used in instance-level recognition for artworks (Yp-
silantis et al., 2022). The ResNet18 model pre-trained on ImageNet, a ResNet
model with 18 deep layers, with Generalized-Mean pooling showed to be an effec-
tive representation in instance-level tasks. The performance was further improved
by fine-tuning the model on the specific art dataset that was used.
This thesis builds on the principle of distant viewing and transfer learning by using
the pre-trained ViT and ResNet18 models. As fine-tuning deemed to be useful in
improving a model’s competence in the art domain, this will also be tested in this
thesis.
4 Method
To find a grouping method that can be generalised to a museum dataset that is not
so well annotated, there must first be experimented with a museum dataset that
has extensive metadata available. By first experimenting on an annotated dataset,
the performance of the neural networks and clustering properties can be measured
by comparing the outcomes to the available metadata. For example, the metadata
can be used to clarify whether the chosen model and clustering properties group
artworks that normally do or do not occur together.
To answer the two sub-questions, namely what neural network works well to ex-
tract high-level features and which number of clusters gives a good representation,
three steps need to be undertaken on the museum dataset with annotations. The
first step is extracting high-level features (descriptors) from the visual data, the
resulting descriptors are to be clustered using the k -Means algorithm with different
values for k in the second step. The third and final step is evaluating the clusters
using the Davies-Bouldin Index and Silhouette Score as metrics and qualitative
metrics such as visualisations. The model and clustering properties that perform
most desirably are considered as answers to the sub-questions and are applied to
the data of the case study.
8
Figure 3: Visual representation of the steps undertaken with the input data as described
in the method. Image input is converted to descriptors using a neural network model.
The descriptors belong to underlying museum departments but are redistributed using
the k -Means algorithm.
By fine-tuning the neural network further on the museum test set, the neural
network can become more proficient in the art domain. This method does require
the dataset to have some suitable metadata available that can be used as labels
in training, such as the department the piece is originally from or the object type
visible in the piece. This results in a neural network that is not only able to detect
general visual patterns but is more specialised in detecting visual patterns in art.
4.2 Clustering
After having extracted the descriptors, they are to be clustered using the k -Means
algorithm for different values of k, namely: 20, 10, 6, 4, and 2. The idea is that
descriptors of images that have visual similarities are close to each other and end
up in the same cluster. The fewer clusters, the more general the clusters are
expected to become. Naturally, this also works the other way around, the more
clusters the more specific they become.
9
4.3 Evaluation
The sets of clusters will be quantitatively and qualitatively evaluated by applying
clustering metrics in combination with plotting parts of the results. The quan-
titative evaluation is to be done by calculating the Davies-Bouldin Index and
Silhouette Coefficient for each of the sets of descriptors for the five different values
of k. The qualitative evaluation is to be done by reducing the dimensionality of the
descriptors to a two dimensional space and plotting them on a two dimensional
axis. By using the UMAP projection algorithm an optimised low-dimensional
representation that is structurally as similar as possible to the higher-dimensional
representation of each of the descriptors is to be found (McInnes et al., 2018). After
making this UMAP projection, the overlap between the clusters will be observable
and the visual similarities between predefined categories in the annotated museum
dataset will be visible. Finally, the images that are closest to the centroids of the
created clusters can be displayed and help guide the decision whether the number
of clusters is representative for the entire museum collection.
5 Experiments
The following sections describe the suggested method on the annotated dataset
from the Metropolitan Museum of Art (the MET) by first creating descriptors,
then performing clustering, and lastly evaluating the different outcomes. First, the
data was retrieved and pre-processed before it was further processed to descriptors.
In order to discover which model, the ResNet18 or the ViT, creates suitable high-
level representations of the artworks they were tested on their performance. This
entailed visualising the possible overlap between descriptors of different collections,
displaying the centroids, comparing the performance of the models after they were
fine-tuned on the MET, and calculating the Davies-Bouldin Indices and Silhouette
Coefficients.
10
Department N Department N
Drawings and Prints 59572 Arts of Africa,
5702
Asian Art 29478 Oceania and the Americas
Greek and Roman Art 29005 Photographs 5620
European Sculpture European Paintings 2246
27333
and Decorative Arts Robert Lehman Collection 2199
Islamic Art 11897 Musical Instruments 1720
Egyptian Art 11229 The Cloisters 1686
The American Wing 9974 Modern and
170
Costume Institute 7593 Contemporary Art
Arms and Armor 6507 The Libraries 92
Medieval Art 6353 Other 59
Ancient Near Eastern Art 5973 Total 224408
Table 1: The department distribution of the artworks from the MET database.
11
5.1.4 Model Retrieval
The descriptors were extracted using two models: the ResNet18 and the ViT. The
TorchVision implementation of the ResNet18 used in this thesis has 17 convolu-
tional layers and 1 fully-connected layer, and was pre-trained on ImageNet (Deng
et al., 2009). A Generalized-Mean pooling layer was added to the model, as it
deemed useful in ILR (Ypsilantis et al., 2022). The ViT-B_16 model3 was pre-
trained on the ImageNet-21K dataset (Ridnik et al., 2021) and was extracted from
Google’s Official checkpoint. This model handles a 16 × 16 image grid, has 12
blocks, followed by a hidden layers of size 768, and a final prediction layer, which
all totals to 43.3 million parameters.
5.2.1 Results
Figure 4 shows the UMAP projection of the MET dataset after the images were
pre-processed and fed through the pre-trained ResNet18 and ViT networks. Both
the ResNet18 and ViT descriptors had quite some overlap between departments,
which means that art pieces from separate departments have visual similarities.
Likewise, the twenty k -Means clusters had some overlap but remarkably less than
the departments. The k -Means clusters were more evenly distributed and more
similar is size than the department clusters.
3
Specific implementation at https://github.com/jeonsworld/ViT-pytorch
12
Figure 4: UMAP projection of the MET dataset, descriptors from the pre-trained
ResNet18 and ViT. Clustering is done by departments and k -Means with k = 20. Six
images are enlarged where their border indicates the real department. (a) and (b) are
originally in two different departments, but both pieces of clothing and in the same clus-
ter after k -Means clustering. (c), a black and white drawing, and (d), a colour print,
are in the same department, but in different clusters. Lastly, (e) and (f ) are in different
clusters, but visually very similar and originally in the same department.
For each of the departments the centroid was calculated using the ResNet18 de-
scriptors. Then, the four images that are closest in euclidean distance to the
centroid of each department are shown in Figure 5. This figure shows that images
close to the centroid, and therefore also each other, might not be visually similar
but do belong to the same department.
13
Figure 5: Four images that are closest to the centroid of each department, from the
ResNet18 model descriptors. Colour of border indicates which department is portrayed.
Images might not always be visually similar (e.g. see American Wing on the top left
side), but belong to the same department.
Then, the centroids for each of the twenty k -Means clusters were calculated using
once again the ResNet18 descriptors. Next, the four images that are closest in
euclidean distance to the centroid of each clusters are shown in Figure 6. Most
clusters consisted of art pieces that were originally from different departments
which can be concluded from the different colour borders. However, there were
clusters that consisted of images from the same department which were very simi-
lar. For example, cluster 15 consisted of images only from the Drawings and Prints
department, and were all depictions of humans. Furthermore, cluster 13 was also
made up of images from Drawings and Prints and only depicted drawings of Native
Americans. This could suggest that there are some sort of subcategories present in
the departments and that the descriptors of these subcategories were placed close
to each other.
14
Figure 6: Four images that are closest to the centroid of k -Means cluster with k = 20,
from the ResNet18 model descriptors. Colour of border indicates which department the
image originally is from. Images might not always be from the same group, but are
visually similar (e.g. see cluster 17).
After fine-tuning the models, their classification ability was measured on the test
set by calculating their accuracy, macro F1-score, macro precision and macro recall.
15
Model Accuracy F1-score Precision Recall
ResNet18 0.264 0.021 0.013 0.05
ViT 0.767 0.610 0.674 0.584
Table 2: Department classification performance on test set of ResNet18 and ViT after
fine-tuning on the training MET dataset.
5.3.1 Results
As seen in Table 2 the partially frozen ViT model outperformed the ResNet18
model in each metric. The ResNet18 model had an accuracy of 0.264 which is
about five times better than simply guessing, but the other metrics showed a
different picture. The chosen F1 score, precision, and recall metrics weighed each
department equally, meaning that the scores of most departments was very low.
The ViT model had an accuracy of 0.767 which is about fifteen times better than
guessing and achieved a reasonable score on the other metrics.
The UMAP projection of the descriptors after fine-tuning on the whole MET
dataset is shown in Figure 7. The ViT model presented more distinct groups than
the ResNet18 model, which could explain the better results in classifying.
Figure 7: UMAP projection of the MET dataset, descriptors from the pre-trained and
fine-tuned ResNet18 and ViT.
16
5.4 Performance Comparison
In this section the descriptors extracted from each of the models, namely the
pre-trained ResNet18 and ViT with and without fine-tuning, were clustered and
compared on Davies-Bouldin Score (DBS) and Silhouette Score (SS). Each set of
descriptors was clustered using k -Means with a k of 20, 10, 6, 4, and 2. The fine-
tuned models were trained on the entire MET dataset, so including the 20% test
data that was previously used to examine their classification ability. An overview
of the results is given in Table 3 and they are discussed in the next section.
5.4.1 Results
First, DBS and SS improved when the descriptors from the ResNet18 model pre-
trained on ImageNet were clustered using k -Means for every k in comparison to
the department groupings. However, the best scores were achieved when clustering
with k = 2.
When the ResNet18 model was further trained on the MET, the DBS of the
department groupings did improve but the SS did not. When clustering with k -
Means, the DBS worsened for k is 20, 10, and 6 in comparison to the non fine-tuned
ResNet18 model. However, the DBS did improve for k is 4 or 2, and so did the SS
when k is 6 or 2.
The descriptors from the ViT model pre-trained on ImageNet-21K performed sim-
ilar to the ResNet18 model, but often it received slightly lower scores. However,
when the ViT model was further trained on the MET dataset it outperformed
most other options. The SS of the departments clusters even outperformed all the
options with a score of 0.19.
The best DBS was achieved with descriptors from the ViT model trained on both
ImageNet-21K and MET, and k -Means clustering with k = 2. The best SS was
achieved with descriptors from the ViT model trained on both ImageNet-21K and
MET, and the department groupings.
So, it is apparent that the ViT model trained on both ImageNet-21K and MET
outperformed the other models, which is the reason why this model was chosen
for further testing.
17
Model Train Data Clustering k DBS SS
Departments 20 8.04 -0.01
20 3.55 0.05
10 3.90 0.05
ImageNet
k -Means 6 3.75 0.05
4 3.72 0.07
2 3.12 0.09
ResNet18
Departments 20 7.89 -0.01
20 3.97 0.04
10 4.03 0.05
ImageNet + MET
k -Means 6 3.84 0.06
4 3.17 0.07
2 2.71 0.10
Departments 20 8.59 0.00
20 3.88 0.04
10 4.09 0.04
ImageNet-21K
k -Means 6 4.21 0.04
4 4.49 0.04
2 4.71 0.04
ViT
Departments 20 2.52 0.19
20 2.63 0.12
10 2.60 0.13
ImageNet-21K + MET
k -Means 6 3.01 0.17
4 2.76 0.16
2 2.01 0.13
Table 3: Clustering evaluation of the ResNet18 and ViT models with different training
data and clustering properties. DBS stands for Davies-Bouldin Score and SS stands for
Silhouette Score. Both scores are rounded to two decimals.
The pre-trained and fine-tuned ViT model with k = 2 performed best when the two
scores are weighted equally. However, when observing Figure 8 it becomes clear
that the two resulting clusters were mainly between (black and white) Drawings
and Prints and the other departments. This option did not properly show the
diversity and was too global which is why the second best performing option was
looked at, which was the fine-tuned ViT model with k = 10; Figure 9 illustrates
the ten clusters that were created.
These ten clusters showed more similarities within their own cluster and were quite
different from each other. Therefore, the number of clusters chosen was ten, so
accordingly when clustering with k -Means, a k of 10 was applied.
18
Figure 8: Fifteen images that are closest to each of the two centroids of k -Means clusters
with k = 2, from the fine-tuned ViT model descriptors.
Figure 9: Four images that are closest to each of the ten centroids of k -Means clusters
with k = 10, from the fine-tuned ViT model descriptors.
19
6 Case Study on Allard-Pierson
Experiments on the MET dataset showed that the pre-trained and fine-tuned ViT
neural network worked well for extracting high-level features from visual data
from a museum and that ten clusters gave a good representation. These properties
formed the foundation for the case study on data from the Allard-Pierson Museum.
This entailed passing the data through the model to extract descriptors, clustering
this data using k -Means with k = 10 and evaluating the outcome by showing the
UMAP projection and images from each cluster.
6.1 Data
The online Allard-Pierson Museum (the AP) dataset consists of 37626 unique
images of objects, where some images have been provided with metadata. It is
not clear from the dataset to which collection a particular artwork belongs, but
according to the museum there are fourteen collections curated by experts4 .
The data from the AP was retrieved from two different sources, called the Beeld-
bank5 which consists of all objects with a digital image and the TIN6 which is
made up of the theater collection pieces of which an image exists. A SPARQL
query was needed to collect the data from the Beeldbank and retrieved 188733
entries, but this reduced to 37300 unique image objects. Some entries in this data
are provided with metadata such as title, description, abstract, etc., but all ob-
jects have a link to the image which were used to download the images. During
the downloading process three more faulty image links were found which were not
downloaded, resulting in 37297 unique images.
The TIN dataset is stored as an XML file where each entry contains some metadata
and a filename which were used to retrieve the image from an online server. The
Beeldbank and TIN datasets were merged into one JSON file. The whole dataset
was then pre-processed in the same manner as the MET dataset was pre-processed
(see Section 5.1.3).
4
See https://allardpierson.nl/en/collecties/ for more information about the collec-
tions.
5
See https://lod.uba.uva.nl/
6
From https://servicetin.adlibhosting.com/te4/wwwopac.ashx?command=search&
database=collectCUE&search=pointer%20108&output=xml&limit=400&startfrom=1&
xmltype=grouped
20
6.2 Results
The UMAP projection of the descriptors is shown in Figure 10. There were distinct
groups visible in the unclustered graph which ideally would have been encapsulated
by the cluster algorithm. This was partly reflected in the clustered graph, e.g.
cluster 7 was a concentrated group of data points. However, in the concentrated
area located in the left center of the plot this wanted behaviour was not executed.
This area was divided into three clusters, namely clusters 1, 6 and 8, while they
seemingly should have belonged in the same cluster since their descriptors were
similar to each other. The division of this large area was probably caused by the
number of clusters; since ten clusters had to be created, the algorithm converged
to this parting. The AP collection is perhaps less diverse than the MET collection
and therefore, there were less than ten distinct visual groups present. This could
be an argument against creating ten clusters for this museum collection and that
the amount of clusters might be museum specific.
Figure 10: UMAP projection of the AP dataset, descriptors from the pre-trained and
fine-tuned ViT. Clustering is done using k -Means with k = 10.
To gain more insight into these created clusters, Figure 11 shows three types of
images of each of the ten clusters. For each cluster there was a centroid and
descriptors of images close to this centroid were centrally located in the cluster
and descriptors far away from the centroid were thus on the edge of the cluster.
So in order to have represented the entire cluster, Figure 11 shows three images
21
of museum objects closest to the centroid, furthest away, and in between. Ideally,
all images of the same cluster should have visual similarities and if the furthest
images and the closest images from the same cluster resemble each other, then the
cluster is visually very similar. If the furthest and closest images do not resemble
each other, the furthest images might be outliers and do not belong to that cluster.
For each artwork, if available, metadata was collected per cluster and cleaned up
as much as possible, for example, by removing stop words. For each cluster, the
ten most relevant words were shown along with the images. The relevance of the
words was calculated using term frequency–inverse document frequency (tf-idf).
This value increases if a word appears often in a certain clusters and not in other
clusters; a word with a high tf-idf score is of relevance for that cluster.
Figure 11: Nine images that are closest, furthest, and in between to the centroid of the
k -Means cluster with k = 10. The words below each cluster are the ten most relevant
words of the cluster.
The images of the AP mainly consisted of letters and books, which was why seven
clusters consisted of this type. However, differences were visible between these
clusters; clusters 1, 6 and 8 consisted more of letters while clusters 2, 4, 5 and 9
22
consisted more of books. The similarity between clusters 1, 6 and 8 is confirmed
in Figure 10, as they split up a concentrated area located in the left center of the
plot. The words “pekidim” and “amarkalim” also occured in all of these clusters
which refer to the Pekidim & Amarkalim archive of incoming letters mainly from
Jerusalem. However, it could also be concluded from the images that cluster 1
consisted more of loosely written shorter letters and clusters 6 and 8 had more
structured longer letters, which could be an argument to separate them.
Clusters 2 and 4 were positioned quite close to each other in the UMAP projection
but were reasonably so their own cluster as cluster 2 primarily had pages with
text on them while cluster 4 had blank pages. Cluster 4 also had some descriptors
located further away from the main cluster, which explains the non-book images.
Furthermore, clusters 5 and 9 mainly consisted of books but their distinction from
the other clusters and each other was quite clearly their color. Additionally, in
clusters 2 and 4 words that are associated with auctions and catalogs occurred and
in cluster 5 the word “omslagtitel” (cover title) appeared, which were all visually
reflected in some of the images.
The remaining three clusters were clusters 3, 7 and 8, and some of the relevant
words could also be linked back to the images; the words “portret”, “klei” and
“botanie” (which translate to portrait, clay and botany respectively) were reflected
in most of the images of their corresponding cluster. However, the clusters also
contained images that are less obvious as a viewer, for example, there were woven
materials in cluster 10, while the other images seemed to indicate that this cluster
mainly consisted of botanical works of art. These kinds of findings do show that
interesting new connections are made between works of art that do not normally
appear together, but are linked because of the similarity in their high-level features.
However, these findings can also show how difficult it is to cluster properly, because
cluster 10 should perhaps actually have been further split.
7 Conclusion
To answer the research question, namely How can museum collections be grouped
in a data-driven manner?, this thesis used the annotated Metropolitan Museum
of Art dataset to test the performance of two neural networks, ResNet18 and ViT,
and the k -Means algorithm for different values of k. Applying transfer learning in
the art domain showed to be useful as it was possible to detect visual patterns in
the images, which became apparent in the qualitative evaluation on the pre-trained
ResNet18; images with similar patterns are arranged in the same cluster.
23
This thesis further denotes that the accuracy of the prediction ability of a partially
frozen ViT is approximately three times better than the ResNet18, which means
that the ViT is more proficient in distinguishing the MET museum departments. A
possible explanation for this behaviour is that the ViT was pre-trained on a larger
dataset and therefore had an advantage over the ResNet that was pre-trained on
a smaller dataset.
The pre-trained and fine-tuned ViT also proved superior during the examination
of the different clustering metrics and after a qualitative analysis of the clusters,
this model combined with creating ten clusters was found to establish the optimal
conditions for representing the entire museum collection.
The case study on the less annotated data from the Allard-Pierson Museum showed
that when the data is not as diverse as the experimental dataset, there is a possi-
bility that multiple clusters will emerge that resemble each other. There might be
slight differences between these similar clusters, and therefore new subgroups are
created, but the number of clusters that best represents the entire museum col-
lection is museum specific. So, the number of clusters should be optimised based
on the dataset. Lastly, there are clusters created that are less obvious, so novel
collections emerge since objects that do not naturally occur together now do.
24
References
Arnold, T., Leonard, P., and Tilton, L. (2017). Knowledge creation
through recommender systems. Digital Scholarship in the Humanities,
32(Supplement_2):ii151–ii157.
Arnold, T. and Tilton, L. (2019). Distant viewing: Analyzing large visual corpora.
Digital Scholarship in the Humanities, 34(Supplement_1):i3–i16.
Bradley, P., Mangasarian, O., and Street, W. (1996). Clustering via concave
minimization. Advances in neural information processing systems, 9.
Cetinic, E. and She, J. (2021). Understanding and creating art with ai: Review
and outlook.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Ima-
genet: A large-scale hierarchical image database. In 2009 IEEE Conference on
Computer Vision and Pattern Recognition, pages 248–255.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and
Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image
recognition at scale.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image
recognition.
Heath, K., Gelfand, N., Ovsjanikov, M., Aanjaneya, M., and Guibas, L. J. (2010).
Image webs: Computing and exploiting connectivity in image collections. In
2010 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, pages 3432–3439. IEEE.
Jaffe, A., Naaman, M., Tassa, T., and Davis, M. (2006). Generating summaries
for large collections of geo-referenced photographs. In Proceedings of the 15th
international conference on World Wide Web, pages 853–854.
Jo, E. S. and Gebru, T. (2020). Lessons from archives: Strategies for collecting
sociocultural data in machine learning. In Proceedings of the 2020 Conference
on Fairness, Accountability, and Transparency, FAT* ’20, page 306–316, New
York, NY, USA. Association for Computing Machinery.
25
Karayev, S., Trentacoste, M., Han, H., Agarwala, A., Darrell, T., Hertzmann, A.,
and Winnemoeller, H. (2014). Recognizing image style. In Proceedings of the
British Machine Vision Conference. BMVA Press.
Khan, S. and van Noord, N. (2021). Stylistic multi-task analysis of ukiyo-e wood-
block prints.
MacQueen, J. et al. (1967). Some methods for classification and analysis of multi-
variate observations. In Proceedings of the fifth Berkeley symposium on mathe-
matical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA.
McInnes, L., Healy, J., and Melville, J. (2018). Umap: Uniform manifold approx-
imation and projection for dimension reduction.
Rematas, K., Fernando, B., Dellaert, F., and Tuytelaars, T. (2015). Dataset
fingerprints: Exploring image collections through data mining. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages
4867–4875.
Ridnik, T., Ben-Baruch, E., Noy, A., and Zelnik-Manor, L. (2021). Imagenet-21k
pretraining for the masses.
Sinha, P., Mehrotra, S., and Jain, R. (2011). Summarization of personal photologs
using multidimensional content and context. In Proceedings of the 1st ACM
International Conference on Multimedia Retrieval, pages 1–8.
Snavely, N., Seitz, S. M., and Szeliski, R. (2006). Photo tourism: exploring photo
collections in 3d. In ACM siggraph 2006 papers, pages 835–846.
26
Van Leuken, R. H., Garcia, L., Olivares, X., and van Zwol, R. (2009). Visual
diversification of image search results. In Proceedings of the 18th international
conference on World wide web, pages 341–350.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, L., and Polosukhin, I. (2017). Attention is all you need.
Ypsilantis, N.-A., Garcia, N., Han, G., Ibrahimi, S., Van Noord, N., and Tolias,
G. (2022). The met dataset: Instance-level recognition for artworks.
Yuan, J., Wu, Y., and Yang, M. (2007). Discovery of collocation patterns: from
visual words to visual phrases. In 2007 IEEE Conference on Computer Vision
and Pattern Recognition, pages 1–8. IEEE.
27