You are on page 1of 32

Computer Vision for Museum

Collection Comparison

Noa I.J. Nonkes


Layout: typeset by the author using LATEX.
Cover illustration: Photo by Chip Clark, Smithsonian Institution
Computer Vision for Museum
Collection Comparison
A Data-driven Analysis

Noa I.J. Nonkes


12705578

Bachelor thesis
Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam
Faculty of Science
Science Park 904
1098 XH Amsterdam

Supervisor
Dr. N.J.E. van Noord
Informatics Institute
Faculty of Science
University of Amsterdam
Science Park 904
1098 XH Amsterdam

Semester 1, 2022
Abstract
When navigating online museum collections, a user first needs to know
which specific collection they want to go through before they can submit
their query, which is not ideal for exploring. A solution would be merging
all the collections, but this leads to a heterogeneous collection. So in this
thesis, a data-driven approach is taken to compare museum collections and
find similarities to help combine collections by using computer vision. First,
experiments are done on a well-annotated museum dataset by extracting
high-level features from images of art pieces using a neural network and then
clustering these features using k -Means with different values for k. Two mod-
els which are pre-trained on an image dataset are tested, namely a ResNet18
and a Vision Transformer (ViT), and are also fine-tuned by further training
them on the well-annotated museum dataset. After quantitative and qualita-
tive evaluation, the pre-trained and fine-tuned ViT model with k = 10 work
best and is applied to the case study, namely the Allard-Pierson Museum
dataset. This shows that the ViT model is indeed suitable for extracting
information from visual museum data, but that the number of clusters that
represents the entire museum collection is museum specific.
Contents
1 Introduction 1

2 Theoretical Foundation 3
2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Residual Neural Network . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Vision Transformer . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 k -Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Clustering Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Davies-Bouldin Index . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Silhouette Coefficient . . . . . . . . . . . . . . . . . . . . . . 6

3 Related Work 6
3.1 Summarising Visual Data . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Analysing Artwork . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Method 8
4.1 Extract Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Experiments 10
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.2 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.1.3 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . 11
5.1.4 Model Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2 Analysing Similarities . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.3 Training of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Case Study on Allard-Pierson 20


6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7 Conclusion 23
1 Introduction
To accumulate data, archives are the longest standing collective effort amongst
humans. They are carefully curated by archive scholars who follow protocols con-
cerning ethics, inclusivity, transparency, consent, and privacy. According to Jo
and Gebru (2020), machine learning datasets often lack the good qualities that
come with following these protocols, resulting in negative consequences such as
bias. They argue that a new specialisation needs to be established that regulates
machine learning datasets equally to how archives are regulated, so that the good
qualities of archives are ensured in these datasets as well. This would suggest
that machine learning datasets have a considerable amount to learn from archives.
However, this well-managed structure in archives might not be ideal in all cases.

Museum collections are managed by curators with specific expertise which can be
compared to the format maintained in handling archives and when these collections
are made available online, the rigid divisions remain. This results in a user having
to know which specific collection they want to explore before they can submit their
query. The shortcoming of this system is highlighted by Mitchell Whitelaw, who
argues that a search query is a compromised and imperfect expression of a feeling
that arises as a “vague dissatisfaction” (Whitelaw, 2015). Thus, the current system
in which online collections are set up does not make them particularly accessible
when exploring.

A possible solution would be merging all the collections, but this would result in
a heterogeneous collection. Another solution would be a data-driven approach to
compare collections and find similarities to help combine collections. The latter
can be done using computer vision, a field in Artificial Intelligence that focuses on
the analysis and processing of visual data by computers.

A particular collection can consist of paintings, maps, and armors which are very
different in style and/or medium, but which have been grouped by a curator with
a specific motivation. By using computer vision to find similarities between art
pieces, interesting new relationships can be found between works of art that may
not have been apparent at first glance. This method would allow the user to
first browse through these collections of art pieces and when selecting a work that
piqued their interest, they are able to further explore through the more finely
curated story line made by experts.

To possibly improve the user-friendliness of navigating online museum collections,


the research in this thesis will be using the visual data of museum collections, i.e.
images of the art pieces, to conduct a data-driven analysis to create new collection

1
clusters. This will be developed while answering the research question, which reads
as follows: How can museum collections be grouped in a data-driven manner?

In order to answer this research question it is divided in two sub-questions:

1. What neural network works well to extract high-level features from museum
visual data?

2. Which number of clusters gives a good representation of the entire museum


collection?

With the aim of answering these questions, a theoretical background will be given,
followed by an overview of related works in summarising visual data and analysing
images with a focus on art pieces. These sections will form the foundation for the
process outline given in the methodology. Then, this method will be applied in
order to experiment on a museum dataset which is accompanied with annotations
in the next section. After the experiments are done, a neural network will be chosen
and the clustering properties will be decided upon which then will be applied on a
case study of the Allard-Pierson Museum dataset; a museum that consists of the
heritage collections of the University of Amsterdam. Lastly, the results will be
discussed and a conclusion about how the museum collections are grouped will be
made.

2
2 Theoretical Foundation
2.1 Neural Networks
For the purpose of extracting information from the visual data of the museum
by applying computer vision, a neural network is required. Two different neural
networks are going to be explored, namely a Residual Neural Network and a Vision
Transformer, since they proved to be useful in the art domain. In Section 3 a
further elaboration will follow on their use in the art domain but first they are
briefly explained in this section.

2.1.1 Residual Neural Network


A Residual Neural Network (ResNet) is an architecture that has integrated identity
mappings, known as skip connections and shown in Figure 1, which are designed
to allow for deeper neural networks (He et al., 2015). Ordinary artificial neural
networks have a limit concerning the number of layers they can possess before
their performance start deteriorating. This phenomenon can be prevented by
these aforementioned skip connections, which add the outputs from the previous
layer to the next layer and therefore help counter the vanishing gradient problem.
The vanishing gradient problem is caused by backpropagation where the gradients
that are used to update the weights in the network become very small which
prevents the weights from updating and the network from learning. Thus, the skip
connections inhibit the gradient from vanishing which results in the capability to
train deeper neural networks.

Figure 1: A residual learning block (He et al., 2015)

2.1.2 Vision Transformer


A Vision Transformer (ViT) Dosovitskiy et al. (2020) is based on a neural net-
work architecture used in natural language processing (NLP) called a Transformer

3
(Vaswani et al., 2017). This architecture is built on a principle called Attention
which entails comparing each token in a sequence with each other token in that
same sequence and therefore allows a better understanding of the connection be-
tween tokens. In NLP this sequence is a piece of text, e.g. a sentence, but in a ViT
this sequence is an image divided into patches. The operation of comparing tokens
is done in multiple heads, called Multi-Head Attention. The Multi-Head Attention
layer is, in combination with a normalisation layer and forward feed layers, called
a Transformer Encoder Block. Multiple of these Blocks can be stacked to create
a Transformer Encoder which is then followed by a Multilayer Perceptron (MLP)
head to perform classification. The full implementation of the architecture can be
seen in Figure 2.

Figure 2: Visual representation of the architecture of a Vision Transformer (Dosovitskiy


et al., 2020)

2.2 k -Means Clustering


Clustering can be applied to arrange a set of data points in an unsupervised man-
ner, the data points being the museum objects in this thesis. A commonly used
clustering technique is called k -Means clustering, where n data points are grouped
in k clusters (MacQueen et al., 1967). The intention is to find a set C that consists
of k clusters, i.e. C = {C1 , . . . Ck }, which minimises the summed-up total of the
sum of squares for each cluster. The sum of squares of a cluster is computed by
summing the euclidean distances between the mean of the cluster, µ or centroid,
and each data point in that cluster. This is represented by equation 1:

4
k X
X
arg min ||x − µi ||2 (1)
C
i=1 x∈Ci

This set is found by randomly instantiating k centroids and assigning each data
point to the cluster with the closest centroid. Then, a new centroid for each cluster
is calculated and the data points are reassigned. The algorithm converges when
no data points are reassigned to a different cluster; this method does not assure
that the optimal solution will be found but can be optimised by repeating the
algorithm and averaging the results.

2.3 Clustering Metrics


After creating the clusters using k -Means they can be evaluated using clustering
metrics. Since the created museum clusters are not trying to replicate predefined
categories, the cluster metrics need to be able to evaluate without labels having
to be known. Hence, in this section two clustering metrics are introduced that do
not require labels.

2.3.1 Davies-Bouldin Index


When the underlying clusters are unknown, Davies-Bouldin Index is a metric that
can be used to evaluate the created clusters (Davies and Bouldin, 1979). The
index represents how similar the clusters are, with a lower index indicating that
the clusters are less similar; zero is the lowest possible value. A lower score is
desirable because that means that the clusters are distinct from each other.

Assume there is a set of k clusters: C = {C1 , . . . , Ck }. The similarity between


cluster Ci and its most similar cluster Cj is defined as Sij which is a trade-off
between the cluster diameter of cluster Ci and Cj , si and sj respectively, and the
distance between the cluster centroids of Ci and Cj , namely dij . The similarity is
s +s
then calculated with the following formula: Sij = idij j

The Davies-Bouldin Index is defined as the average of all the Sij where the clusters
i and j are most similar and i ̸= j. This is represented by equation 2:

k
1X
DBS = max Sij (2)
k i=1 i̸=j

5
2.3.2 Silhouette Coefficient
Another metric that can be used when the underlying clusters are unknown is
the Silhouette Coefficient (Rousseeuw, 1987). This score is bounded between -1
and 1 where a low score is given to incorrect clustering, scores around zero signify
overlying clusters, and a positive score is given to dense clusters. The Silhouette
Coefficient for a set of n samples called N is given by equation 3:

1 X bx − ax
SS = (3)
n x∈N max(ax , bx )

In this equation, ax represents the mean distance between sample x and all the
samples in the same cluster, and bx represents the mean distance between sample
x and all the samples in the second closest cluster.

3 Related Work
3.1 Summarising Visual Data
The grouping of artworks is central to this research, and grouping a large amount
of images is not a novel problem in the field of computer science. As the amount
of visual data increases, so does the need to group and summarise it in order to
better understand it.

In order to summarise and explore visual data, a possible method is using SIFT
keypoint detection to generate descriptors which hold information about scale, ori-
entation, and location (Lowe, 2004). Generated descriptors from different images
can be compared, in distance for example, to find similarities between images and
then be used to e.g. explore images in 3D or create graphs by linking images
(Heath et al., 2010)(Snavely et al., 2006). An unsupervised method to finding
patterns within a large collection of images is data mining, where connections
within the data are made by inspecting which two patterns often occur together
(Rematas et al., 2015)(Yuan et al., 2007). According to Rematas et al., larger
semantically significant areas within images are discovered and linked with other
images through this method, allowing a user to go through images from one pattern
to another in a semantically purposeful way. This thesis builds on the fact that
there are patterns present in images which can be linked to each other and thus
make meaningful connections. This would mean that there are also semantically
meaningful patterns in images of works of art which are to be linked by clustering
in this thesis.

6
Furthermore, there are approaches that do not use patterns or keypoints to find
similarities but that use metadata and/or individual pixel data from images, such
as going through each pixel in an image, keep count of how often a color appears,
and use this pixel data to compare images (Van Leuken et al., 2009). Images can be
provided with metadata, such as their capture location, and then be clustered on
this geographical location (Jaffe et al., 2006), and when more metadata is available,
such as short text labels and capture time, these too can be taken into account
when clustering or comparing (Kennedy and Naaman, 2008). Lastly, another way
to handle large amounts of visual data is discussed in a paper by Sinha et al.
(2011), specifically how to give a good summary for a large collection of personal
images, i.e. a photolog. A good summary has to meet three requirements: quality,
diversity, and coverage, whilst also complying with a users information need. The
three requirements are calculated using pixel features and metadata, and the set of
images that maximises these demands is found by performing a greedy search and
chosen as summary of the photolog. This paper demonstrates that image metadata
are useful features in summarising visual data and although there is more focus on
finding patterns in this thesis, metadata will be used in order to understand the
created clusters better.

3.2 Analysing Artwork


In order to group art pieces, they must first be analysed to extract meaningful
features. The field of digital humanities has already indicated that there is a need
for a framework to execute visual data analysis on a larger scale (Arnold et al.,
2017). The distant viewing method proved to be useful; a method where new high-
level semantic metadata can be extracted from the images by actually “viewing”
the images using computer vision (Arnold and Tilton, 2019). This distant viewing
framework can be applied to extract features which then can be used to automat-
ically recognise the content or style of works of art. A possible application would
be a smart retrieval system that makes analysing larger art collections easier and
more efficient (Cetinic and She, 2021).

Research shows that features extracted from a neural network trained for a different
task work better than other low-level features when classifying the style of an
image (Karayev et al., 2014). This is called transfer learning, which is a technique
in which a pre-trained model of a given domain assists a task in a different domain.
Then, this model can be fine-tuned on an art dataset by retraining certain layers
of the model to further improve the model’s proficiency in the art domain (Milani
and Fraternali, 2021).

In addition, transfer learning seemed to be effective for style classification in Ukiyo-

7
e woodblock prints (Khan and van Noord, 2021). A ResNet model (He et al., 2015)
pre-trained on the ImageNet dataset (Deng et al., 2009) was compared to a ViT
model (Dosovitskiy et al., 2020) pre-trained on the larger ImageNet-21K dataset
(Ridnik et al., 2021), and there was experimented with completely and partially
freezing the models, in which the partially frozen ViT model was superior. Partially
freezing allowed the model to learn some task-specific features but made the first
layers of the model behave somewhat as a feature extractor.

Lastly, transfer learning was used in instance-level recognition for artworks (Yp-
silantis et al., 2022). The ResNet18 model pre-trained on ImageNet, a ResNet
model with 18 deep layers, with Generalized-Mean pooling showed to be an effec-
tive representation in instance-level tasks. The performance was further improved
by fine-tuning the model on the specific art dataset that was used.

This thesis builds on the principle of distant viewing and transfer learning by using
the pre-trained ViT and ResNet18 models. As fine-tuning deemed to be useful in
improving a model’s competence in the art domain, this will also be tested in this
thesis.

4 Method
To find a grouping method that can be generalised to a museum dataset that is not
so well annotated, there must first be experimented with a museum dataset that
has extensive metadata available. By first experimenting on an annotated dataset,
the performance of the neural networks and clustering properties can be measured
by comparing the outcomes to the available metadata. For example, the metadata
can be used to clarify whether the chosen model and clustering properties group
artworks that normally do or do not occur together.

To answer the two sub-questions, namely what neural network works well to ex-
tract high-level features and which number of clusters gives a good representation,
three steps need to be undertaken on the museum dataset with annotations. The
first step is extracting high-level features (descriptors) from the visual data, the
resulting descriptors are to be clustered using the k -Means algorithm with different
values for k in the second step. The third and final step is evaluating the clusters
using the Davies-Bouldin Index and Silhouette Score as metrics and qualitative
metrics such as visualisations. The model and clustering properties that perform
most desirably are considered as answers to the sub-questions and are applied to
the data of the case study.

8
Figure 3: Visual representation of the steps undertaken with the input data as described
in the method. Image input is converted to descriptors using a neural network model.
The descriptors belong to underlying museum departments but are redistributed using
the k -Means algorithm.

4.1 Extract Descriptors


To extract descriptors from images of museum objects, a neural network that is
pre-trained on a image dataset is to be used, which is able to detect visual patterns
and can possibly unveil useful high-level information. By removing the prediction
layer from a neural network it essentially behaves as a high-level feature extractor
and the resulting vectors are the descriptors mentioned before. Applying transfer
learning, by using a pre-trained neural network, will result in descriptors of images
with similar visual pattern having a smaller distance between them than images
with very different visual patterns.

By fine-tuning the neural network further on the museum test set, the neural
network can become more proficient in the art domain. This method does require
the dataset to have some suitable metadata available that can be used as labels
in training, such as the department the piece is originally from or the object type
visible in the piece. This results in a neural network that is not only able to detect
general visual patterns but is more specialised in detecting visual patterns in art.

4.2 Clustering
After having extracted the descriptors, they are to be clustered using the k -Means
algorithm for different values of k, namely: 20, 10, 6, 4, and 2. The idea is that
descriptors of images that have visual similarities are close to each other and end
up in the same cluster. The fewer clusters, the more general the clusters are
expected to become. Naturally, this also works the other way around, the more
clusters the more specific they become.

9
4.3 Evaluation
The sets of clusters will be quantitatively and qualitatively evaluated by applying
clustering metrics in combination with plotting parts of the results. The quan-
titative evaluation is to be done by calculating the Davies-Bouldin Index and
Silhouette Coefficient for each of the sets of descriptors for the five different values
of k. The qualitative evaluation is to be done by reducing the dimensionality of the
descriptors to a two dimensional space and plotting them on a two dimensional
axis. By using the UMAP projection algorithm an optimised low-dimensional
representation that is structurally as similar as possible to the higher-dimensional
representation of each of the descriptors is to be found (McInnes et al., 2018). After
making this UMAP projection, the overlap between the clusters will be observable
and the visual similarities between predefined categories in the annotated museum
dataset will be visible. Finally, the images that are closest to the centroids of the
created clusters can be displayed and help guide the decision whether the number
of clusters is representative for the entire museum collection.

5 Experiments
The following sections describe the suggested method on the annotated dataset
from the Metropolitan Museum of Art (the MET) by first creating descriptors,
then performing clustering, and lastly evaluating the different outcomes. First, the
data was retrieved and pre-processed before it was further processed to descriptors.
In order to discover which model, the ResNet18 or the ViT, creates suitable high-
level representations of the artworks they were tested on their performance. This
entailed visualising the possible overlap between descriptors of different collections,
displaying the centroids, comparing the performance of the models after they were
fine-tuned on the MET, and calculating the Davies-Bouldin Indices and Silhouette
Coefficients.

5.1 Experimental Setup


5.1.1 Data
The annotated museum dataset used in this thesis was retrieved from the MET
which currently consists of about 400k images, each provided with metadata such
as its title, object date, object name, department, etc. From all the metadata only
the departments were used in this research, of which there are twenty different
variants; the distribution of departments can be seen in Table 1. The MET dataset
consists of multiple images of the same artwork which were filtered out so that there
is only one image per piece of art; this left 224408 unique artwork images.

10
Department N Department N
Drawings and Prints 59572 Arts of Africa,
5702
Asian Art 29478 Oceania and the Americas
Greek and Roman Art 29005 Photographs 5620
European Sculpture European Paintings 2246
27333
and Decorative Arts Robert Lehman Collection 2199
Islamic Art 11897 Musical Instruments 1720
Egyptian Art 11229 The Cloisters 1686
The American Wing 9974 Modern and
170
Costume Institute 7593 Contemporary Art
Arms and Armor 6507 The Libraries 92
Medieval Art 6353 Other 59
Ancient Near Eastern Art 5973 Total 224408

Table 1: The department distribution of the artworks from the MET database.

5.1.2 Data Retrieval


The images of the MET dataset can be retrieved from the internet and have
already been resized so that the largest side is 500 pixels. The ground truth labels
are stored in a JSON file and consist of a “path” and “id” variable, where “path”
refers to path to image file and “id” refers to which specific object is depicted on
the image1 . The remaining metadata can be found in a different location and is
conveniently stored in a CSV file2 .

5.1.3 Data Pre-processing


To standardise the images before they went through the neural networks, they
were pre-processed first. The first step was randomly cropping and resizing the
images to be 500 × 500 so that each image had the same dimensions. First, the
image was cropped with a lower bound scale of 0.7 and an upper bound scale of
1.0 with respect to the to the area of the original image. Next, this result was
resized with a ratio between 0.99 and 1/0.99 to be a size of 500 × 500. To center
the data, each of the three colour channels of the input images were normalised
by subtracting the means, along with their matching standard deviations, of the
colour channels of the MET dataset.
1
Downloaded from http://cmp.felk.cvut.cz/met/#models
2
From https://media.githubusercontent.com/media/metmuseum/openaccess/master/
MetObjects.csv

11
5.1.4 Model Retrieval
The descriptors were extracted using two models: the ResNet18 and the ViT. The
TorchVision implementation of the ResNet18 used in this thesis has 17 convolu-
tional layers and 1 fully-connected layer, and was pre-trained on ImageNet (Deng
et al., 2009). A Generalized-Mean pooling layer was added to the model, as it
deemed useful in ILR (Ypsilantis et al., 2022). The ViT-B_16 model3 was pre-
trained on the ImageNet-21K dataset (Ridnik et al., 2021) and was extracted from
Google’s Official checkpoint. This model handles a 16 × 16 image grid, has 12
blocks, followed by a hidden layers of size 768, and a final prediction layer, which
all totals to 43.3 million parameters.

5.2 Analysing Similarities


In this section, the MET data was explored using high-level feature representations
extracted from the pre-trained models. The descriptors were created by removing
the prediction layer from both models so that the output of the ResNet18 and ViT
are vectors of length 512 and 768 respectively. Then, these vectors were not only
grouped by their predefined departments but also by using k -Means clustering with
k = 20. Thereafter, to make it more intuitive to interpret these vectors, they were
reduced to 2D vectors using the UMAP projection and plotted. Lastly, images
close to the centroids from both the department clusters as the k -Means clusters
were displayed in order to get a better understanding of both the departments and
the clusters.

5.2.1 Results
Figure 4 shows the UMAP projection of the MET dataset after the images were
pre-processed and fed through the pre-trained ResNet18 and ViT networks. Both
the ResNet18 and ViT descriptors had quite some overlap between departments,
which means that art pieces from separate departments have visual similarities.
Likewise, the twenty k -Means clusters had some overlap but remarkably less than
the departments. The k -Means clusters were more evenly distributed and more
similar is size than the department clusters.
3
Specific implementation at https://github.com/jeonsworld/ViT-pytorch

12
Figure 4: UMAP projection of the MET dataset, descriptors from the pre-trained
ResNet18 and ViT. Clustering is done by departments and k -Means with k = 20. Six
images are enlarged where their border indicates the real department. (a) and (b) are
originally in two different departments, but both pieces of clothing and in the same clus-
ter after k -Means clustering. (c), a black and white drawing, and (d), a colour print,
are in the same department, but in different clusters. Lastly, (e) and (f ) are in different
clusters, but visually very similar and originally in the same department.

For each of the departments the centroid was calculated using the ResNet18 de-
scriptors. Then, the four images that are closest in euclidean distance to the
centroid of each department are shown in Figure 5. This figure shows that images
close to the centroid, and therefore also each other, might not be visually similar
but do belong to the same department.

13
Figure 5: Four images that are closest to the centroid of each department, from the
ResNet18 model descriptors. Colour of border indicates which department is portrayed.
Images might not always be visually similar (e.g. see American Wing on the top left
side), but belong to the same department.

Then, the centroids for each of the twenty k -Means clusters were calculated using
once again the ResNet18 descriptors. Next, the four images that are closest in
euclidean distance to the centroid of each clusters are shown in Figure 6. Most
clusters consisted of art pieces that were originally from different departments
which can be concluded from the different colour borders. However, there were
clusters that consisted of images from the same department which were very simi-
lar. For example, cluster 15 consisted of images only from the Drawings and Prints
department, and were all depictions of humans. Furthermore, cluster 13 was also
made up of images from Drawings and Prints and only depicted drawings of Native
Americans. This could suggest that there are some sort of subcategories present in
the departments and that the descriptors of these subcategories were placed close
to each other.

14
Figure 6: Four images that are closest to the centroid of k -Means cluster with k = 20,
from the ResNet18 model descriptors. Colour of border indicates which department the
image originally is from. Images might not always be from the same group, but are
visually similar (e.g. see cluster 17).

5.3 Training of Models


Both models were also trained further on the MET dataset and the twenty depart-
ments were used as classification labels. The dataset was randomly split into 80%
training data and 20% testing data. A cross entropy loss and a stochastic descent
gradient optimizer with a learning rate of 0.001 were used. Both models were
trained for ten epochs and with a batch size larger than one. To make training
more efficient, two hidden layers of size 512 were added to the ResNet18 model,
each with a ReLU activation function followed by a final classification layer of
size 20 with a softmax activation function. Since partially freezing the ViT model
deemed to be useful in style identification (Khan and van Noord, 2021), this was
also applied in this thesis.

After fine-tuning the models, their classification ability was measured on the test
set by calculating their accuracy, macro F1-score, macro precision and macro recall.

15
Model Accuracy F1-score Precision Recall
ResNet18 0.264 0.021 0.013 0.05
ViT 0.767 0.610 0.674 0.584

Table 2: Department classification performance on test set of ResNet18 and ViT after
fine-tuning on the training MET dataset.

Furthermore, a UMAP projection was performed on descriptors extracted from


both models, which were trained on the whole MET dataset, and the results were
plotted.

5.3.1 Results
As seen in Table 2 the partially frozen ViT model outperformed the ResNet18
model in each metric. The ResNet18 model had an accuracy of 0.264 which is
about five times better than simply guessing, but the other metrics showed a
different picture. The chosen F1 score, precision, and recall metrics weighed each
department equally, meaning that the scores of most departments was very low.
The ViT model had an accuracy of 0.767 which is about fifteen times better than
guessing and achieved a reasonable score on the other metrics.

The UMAP projection of the descriptors after fine-tuning on the whole MET
dataset is shown in Figure 7. The ViT model presented more distinct groups than
the ResNet18 model, which could explain the better results in classifying.

Figure 7: UMAP projection of the MET dataset, descriptors from the pre-trained and
fine-tuned ResNet18 and ViT.

16
5.4 Performance Comparison
In this section the descriptors extracted from each of the models, namely the
pre-trained ResNet18 and ViT with and without fine-tuning, were clustered and
compared on Davies-Bouldin Score (DBS) and Silhouette Score (SS). Each set of
descriptors was clustered using k -Means with a k of 20, 10, 6, 4, and 2. The fine-
tuned models were trained on the entire MET dataset, so including the 20% test
data that was previously used to examine their classification ability. An overview
of the results is given in Table 3 and they are discussed in the next section.

5.4.1 Results
First, DBS and SS improved when the descriptors from the ResNet18 model pre-
trained on ImageNet were clustered using k -Means for every k in comparison to
the department groupings. However, the best scores were achieved when clustering
with k = 2.

When the ResNet18 model was further trained on the MET, the DBS of the
department groupings did improve but the SS did not. When clustering with k -
Means, the DBS worsened for k is 20, 10, and 6 in comparison to the non fine-tuned
ResNet18 model. However, the DBS did improve for k is 4 or 2, and so did the SS
when k is 6 or 2.

The descriptors from the ViT model pre-trained on ImageNet-21K performed sim-
ilar to the ResNet18 model, but often it received slightly lower scores. However,
when the ViT model was further trained on the MET dataset it outperformed
most other options. The SS of the departments clusters even outperformed all the
options with a score of 0.19.

The best DBS was achieved with descriptors from the ViT model trained on both
ImageNet-21K and MET, and k -Means clustering with k = 2. The best SS was
achieved with descriptors from the ViT model trained on both ImageNet-21K and
MET, and the department groupings.

So, it is apparent that the ViT model trained on both ImageNet-21K and MET
outperformed the other models, which is the reason why this model was chosen
for further testing.

17
Model Train Data Clustering k DBS SS
Departments 20 8.04 -0.01
20 3.55 0.05
10 3.90 0.05
ImageNet
k -Means 6 3.75 0.05
4 3.72 0.07
2 3.12 0.09
ResNet18
Departments 20 7.89 -0.01
20 3.97 0.04
10 4.03 0.05
ImageNet + MET
k -Means 6 3.84 0.06
4 3.17 0.07
2 2.71 0.10
Departments 20 8.59 0.00
20 3.88 0.04
10 4.09 0.04
ImageNet-21K
k -Means 6 4.21 0.04
4 4.49 0.04
2 4.71 0.04
ViT
Departments 20 2.52 0.19
20 2.63 0.12
10 2.60 0.13
ImageNet-21K + MET
k -Means 6 3.01 0.17
4 2.76 0.16
2 2.01 0.13

Table 3: Clustering evaluation of the ResNet18 and ViT models with different training
data and clustering properties. DBS stands for Davies-Bouldin Score and SS stands for
Silhouette Score. Both scores are rounded to two decimals.

The pre-trained and fine-tuned ViT model with k = 2 performed best when the two
scores are weighted equally. However, when observing Figure 8 it becomes clear
that the two resulting clusters were mainly between (black and white) Drawings
and Prints and the other departments. This option did not properly show the
diversity and was too global which is why the second best performing option was
looked at, which was the fine-tuned ViT model with k = 10; Figure 9 illustrates
the ten clusters that were created.

These ten clusters showed more similarities within their own cluster and were quite
different from each other. Therefore, the number of clusters chosen was ten, so
accordingly when clustering with k -Means, a k of 10 was applied.

18
Figure 8: Fifteen images that are closest to each of the two centroids of k -Means clusters
with k = 2, from the fine-tuned ViT model descriptors.

Figure 9: Four images that are closest to each of the ten centroids of k -Means clusters
with k = 10, from the fine-tuned ViT model descriptors.

19
6 Case Study on Allard-Pierson
Experiments on the MET dataset showed that the pre-trained and fine-tuned ViT
neural network worked well for extracting high-level features from visual data
from a museum and that ten clusters gave a good representation. These properties
formed the foundation for the case study on data from the Allard-Pierson Museum.
This entailed passing the data through the model to extract descriptors, clustering
this data using k -Means with k = 10 and evaluating the outcome by showing the
UMAP projection and images from each cluster.

6.1 Data
The online Allard-Pierson Museum (the AP) dataset consists of 37626 unique
images of objects, where some images have been provided with metadata. It is
not clear from the dataset to which collection a particular artwork belongs, but
according to the museum there are fourteen collections curated by experts4 .

The data from the AP was retrieved from two different sources, called the Beeld-
bank5 which consists of all objects with a digital image and the TIN6 which is
made up of the theater collection pieces of which an image exists. A SPARQL
query was needed to collect the data from the Beeldbank and retrieved 188733
entries, but this reduced to 37300 unique image objects. Some entries in this data
are provided with metadata such as title, description, abstract, etc., but all ob-
jects have a link to the image which were used to download the images. During
the downloading process three more faulty image links were found which were not
downloaded, resulting in 37297 unique images.

The TIN dataset is stored as an XML file where each entry contains some metadata
and a filename which were used to retrieve the image from an online server. The
Beeldbank and TIN datasets were merged into one JSON file. The whole dataset
was then pre-processed in the same manner as the MET dataset was pre-processed
(see Section 5.1.3).
4
See https://allardpierson.nl/en/collecties/ for more information about the collec-
tions.
5
See https://lod.uba.uva.nl/
6
From https://servicetin.adlibhosting.com/te4/wwwopac.ashx?command=search&
database=collectCUE&search=pointer%20108&output=xml&limit=400&startfrom=1&
xmltype=grouped

20
6.2 Results
The UMAP projection of the descriptors is shown in Figure 10. There were distinct
groups visible in the unclustered graph which ideally would have been encapsulated
by the cluster algorithm. This was partly reflected in the clustered graph, e.g.
cluster 7 was a concentrated group of data points. However, in the concentrated
area located in the left center of the plot this wanted behaviour was not executed.
This area was divided into three clusters, namely clusters 1, 6 and 8, while they
seemingly should have belonged in the same cluster since their descriptors were
similar to each other. The division of this large area was probably caused by the
number of clusters; since ten clusters had to be created, the algorithm converged
to this parting. The AP collection is perhaps less diverse than the MET collection
and therefore, there were less than ten distinct visual groups present. This could
be an argument against creating ten clusters for this museum collection and that
the amount of clusters might be museum specific.

Figure 10: UMAP projection of the AP dataset, descriptors from the pre-trained and
fine-tuned ViT. Clustering is done using k -Means with k = 10.

To gain more insight into these created clusters, Figure 11 shows three types of
images of each of the ten clusters. For each cluster there was a centroid and
descriptors of images close to this centroid were centrally located in the cluster
and descriptors far away from the centroid were thus on the edge of the cluster.
So in order to have represented the entire cluster, Figure 11 shows three images

21
of museum objects closest to the centroid, furthest away, and in between. Ideally,
all images of the same cluster should have visual similarities and if the furthest
images and the closest images from the same cluster resemble each other, then the
cluster is visually very similar. If the furthest and closest images do not resemble
each other, the furthest images might be outliers and do not belong to that cluster.

For each artwork, if available, metadata was collected per cluster and cleaned up
as much as possible, for example, by removing stop words. For each cluster, the
ten most relevant words were shown along with the images. The relevance of the
words was calculated using term frequency–inverse document frequency (tf-idf).
This value increases if a word appears often in a certain clusters and not in other
clusters; a word with a high tf-idf score is of relevance for that cluster.

Figure 11: Nine images that are closest, furthest, and in between to the centroid of the
k -Means cluster with k = 10. The words below each cluster are the ten most relevant
words of the cluster.

The images of the AP mainly consisted of letters and books, which was why seven
clusters consisted of this type. However, differences were visible between these
clusters; clusters 1, 6 and 8 consisted more of letters while clusters 2, 4, 5 and 9

22
consisted more of books. The similarity between clusters 1, 6 and 8 is confirmed
in Figure 10, as they split up a concentrated area located in the left center of the
plot. The words “pekidim” and “amarkalim” also occured in all of these clusters
which refer to the Pekidim & Amarkalim archive of incoming letters mainly from
Jerusalem. However, it could also be concluded from the images that cluster 1
consisted more of loosely written shorter letters and clusters 6 and 8 had more
structured longer letters, which could be an argument to separate them.

Clusters 2 and 4 were positioned quite close to each other in the UMAP projection
but were reasonably so their own cluster as cluster 2 primarily had pages with
text on them while cluster 4 had blank pages. Cluster 4 also had some descriptors
located further away from the main cluster, which explains the non-book images.
Furthermore, clusters 5 and 9 mainly consisted of books but their distinction from
the other clusters and each other was quite clearly their color. Additionally, in
clusters 2 and 4 words that are associated with auctions and catalogs occurred and
in cluster 5 the word “omslagtitel” (cover title) appeared, which were all visually
reflected in some of the images.

The remaining three clusters were clusters 3, 7 and 8, and some of the relevant
words could also be linked back to the images; the words “portret”, “klei” and
“botanie” (which translate to portrait, clay and botany respectively) were reflected
in most of the images of their corresponding cluster. However, the clusters also
contained images that are less obvious as a viewer, for example, there were woven
materials in cluster 10, while the other images seemed to indicate that this cluster
mainly consisted of botanical works of art. These kinds of findings do show that
interesting new connections are made between works of art that do not normally
appear together, but are linked because of the similarity in their high-level features.
However, these findings can also show how difficult it is to cluster properly, because
cluster 10 should perhaps actually have been further split.

7 Conclusion
To answer the research question, namely How can museum collections be grouped
in a data-driven manner?, this thesis used the annotated Metropolitan Museum
of Art dataset to test the performance of two neural networks, ResNet18 and ViT,
and the k -Means algorithm for different values of k. Applying transfer learning in
the art domain showed to be useful as it was possible to detect visual patterns in
the images, which became apparent in the qualitative evaluation on the pre-trained
ResNet18; images with similar patterns are arranged in the same cluster.

23
This thesis further denotes that the accuracy of the prediction ability of a partially
frozen ViT is approximately three times better than the ResNet18, which means
that the ViT is more proficient in distinguishing the MET museum departments. A
possible explanation for this behaviour is that the ViT was pre-trained on a larger
dataset and therefore had an advantage over the ResNet that was pre-trained on
a smaller dataset.

The pre-trained and fine-tuned ViT also proved superior during the examination
of the different clustering metrics and after a qualitative analysis of the clusters,
this model combined with creating ten clusters was found to establish the optimal
conditions for representing the entire museum collection.

The case study on the less annotated data from the Allard-Pierson Museum showed
that when the data is not as diverse as the experimental dataset, there is a possi-
bility that multiple clusters will emerge that resemble each other. There might be
slight differences between these similar clusters, and therefore new subgroups are
created, but the number of clusters that best represents the entire museum col-
lection is museum specific. So, the number of clusters should be optimised based
on the dataset. Lastly, there are clusters created that are less obvious, so novel
collections emerge since objects that do not naturally occur together now do.

In conclusion, a fine-tuned ViT model works well to extract high-level features in


the art domain and the number of clusters that seems to give a good representation
of the entire museum collection is dataset dependant. So, museum collections can
be grouped by extracting descriptors from their visual data and clustering these
descriptors using the k -Means algorithm with a k that is optimised to that museum
dataset. These conditions form a foundation for future work where there can be
experimented with different clustering algorithms, e.g. k -Medians (Bradley et al.,
1996) which is less sensitive to outliers than k -Means, and training the model on
more museum data, so the model performs better in the art domain. Lastly, the
improvement of user-friendliness of navigating the created museum collections can
be assessed by conducting a user study.

24
References
Arnold, T., Leonard, P., and Tilton, L. (2017). Knowledge creation
through recommender systems. Digital Scholarship in the Humanities,
32(Supplement_2):ii151–ii157.

Arnold, T. and Tilton, L. (2019). Distant viewing: Analyzing large visual corpora.
Digital Scholarship in the Humanities, 34(Supplement_1):i3–i16.

Bradley, P., Mangasarian, O., and Street, W. (1996). Clustering via concave
minimization. Advances in neural information processing systems, 9.

Cetinic, E. and She, J. (2021). Understanding and creating art with ai: Review
and outlook.

Davies, D. L. and Bouldin, D. W. (1979). A cluster separation measure. IEEE


Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):224–
227.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Ima-
genet: A large-scale hierarchical image database. In 2009 IEEE Conference on
Computer Vision and Pattern Recognition, pages 248–255.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and
Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image
recognition at scale.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image
recognition.

Heath, K., Gelfand, N., Ovsjanikov, M., Aanjaneya, M., and Guibas, L. J. (2010).
Image webs: Computing and exploiting connectivity in image collections. In
2010 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, pages 3432–3439. IEEE.

Jaffe, A., Naaman, M., Tassa, T., and Davis, M. (2006). Generating summaries
for large collections of geo-referenced photographs. In Proceedings of the 15th
international conference on World Wide Web, pages 853–854.

Jo, E. S. and Gebru, T. (2020). Lessons from archives: Strategies for collecting
sociocultural data in machine learning. In Proceedings of the 2020 Conference
on Fairness, Accountability, and Transparency, FAT* ’20, page 306–316, New
York, NY, USA. Association for Computing Machinery.

25
Karayev, S., Trentacoste, M., Han, H., Agarwala, A., Darrell, T., Hertzmann, A.,
and Winnemoeller, H. (2014). Recognizing image style. In Proceedings of the
British Machine Vision Conference. BMVA Press.

Kennedy, L. S. and Naaman, M. (2008). Generating diverse and representative


image search results for landmarks. In Proceedings of the 17th International
Conference on World Wide Web, WWW ’08, page 297–306, New York, NY,
USA. Association for Computing Machinery.

Khan, S. and van Noord, N. (2021). Stylistic multi-task analysis of ukiyo-e wood-
block prints.

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints.


International journal of computer vision, 60(2):91–110.

MacQueen, J. et al. (1967). Some methods for classification and analysis of multi-
variate observations. In Proceedings of the fifth Berkeley symposium on mathe-
matical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA.

McInnes, L., Healy, J., and Melville, J. (2018). Umap: Uniform manifold approx-
imation and projection for dimension reduction.

Milani, F. and Fraternali, P. (2021). A dataset and a convolutional model for


iconography classification in paintings. Journal on Computing and Cultural Her-
itage, 14(4):1–18.

Rematas, K., Fernando, B., Dellaert, F., and Tuytelaars, T. (2015). Dataset
fingerprints: Exploring image collections through data mining. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages
4867–4875.

Ridnik, T., Ben-Baruch, E., Noy, A., and Zelnik-Manor, L. (2021). Imagenet-21k
pretraining for the masses.

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and val-


idation of cluster analysis. Journal of Computational and Applied Mathematics,
20:53–65.

Sinha, P., Mehrotra, S., and Jain, R. (2011). Summarization of personal photologs
using multidimensional content and context. In Proceedings of the 1st ACM
International Conference on Multimedia Retrieval, pages 1–8.

Snavely, N., Seitz, S. M., and Szeliski, R. (2006). Photo tourism: exploring photo
collections in 3d. In ACM siggraph 2006 papers, pages 835–846.

26
Van Leuken, R. H., Garcia, L., Olivares, X., and van Zwol, R. (2009). Visual
diversification of image search results. In Proceedings of the 18th international
conference on World wide web, pages 341–350.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, L., and Polosukhin, I. (2017). Attention is all you need.

Whitelaw, M. (2015). Generous interfaces for digital cultural collections. Digit.


Humanit. Q., 9(1).

Ypsilantis, N.-A., Garcia, N., Han, G., Ibrahimi, S., Van Noord, N., and Tolias,
G. (2022). The met dataset: Instance-level recognition for artworks.

Yuan, J., Wu, Y., and Yang, M. (2007). Discovery of collocation patterns: from
visual words to visual phrases. In 2007 IEEE Conference on Computer Vision
and Pattern Recognition, pages 1–8. IEEE.

27

You might also like