You are on page 1of 249

Domino Admin Docs Documentation

Release 4.4.0

Domino Data Lab

Mar 05, 2021


Contents

1 About Domino 4 3

2 Architecture 5
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 User accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Service mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Kubernetes 11
3.1 Cluster requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Requirements checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Domino on EKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Domino on GKE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Domino on AKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Domino on OpenShift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7 NVIDIA DGX in Domino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8 Domino in Multi-Tenant Kubernetes Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.9 Encryption in transit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.10 Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Installation 61
4.1 Installation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Configuration Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Installer configuration examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Private or offline installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 fleetcommand-agent release notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 Configuration 101
5.1 Central Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Change the default project for new users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3 Project stage configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4 Domino integration with Atlassian Jira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6 Compute 121
6.1 Managing the Compute Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Hardware Tier best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

i
6.3 Model resource quotas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.4 Persistent volume management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5 Adding a node pool to your Domino cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.6 Removing a node from service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7 Keycloak authentication service 145


7.1 Accessing the Keycloak UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2 Local username and password configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.3 LDAP / AD federation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.4 Single Sign-On configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

8 Operations 187
8.1 Domino application logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.2 Domino monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.3 Sizing infrastructure for Domino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

9 Data management 193


9.1 Data in Domino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.2 Data flow in Domino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.3 External Data Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.4 Datasets administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.5 Submitting GDPR requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

10 User management 215


10.1 Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
10.2 License usage reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

11 Environments 223
11.1 Environment management best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
11.2 Caching environment images in EKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

12 Disaster recovery 231


12.1 Backing up Domino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

13 Control Center 235


13.1 Control Center overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
13.2 Exporting Control Center data with the API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

ii
Domino Admin Docs Documentation, Release 4.4.0

This guide describes how to install, operate, administer, and configure the Domino application in your own Kubernetes
cluster. This content is applicable to Domino users with self-installation licenses.
If you are interested in running Domino as a managed service in your cloud or in a single-tenant vendor cloud, contact
Domino. Managed service customers will have installation, operations, and administration handled via professional
services, and the content of this guide will not be required or applicable.

Contents 1
Domino Admin Docs Documentation, Release 4.4.0

2 Contents
CHAPTER 1

About Domino 4

Domino is a data science platform that enables fast, reproducible, and collaborative work on data products like models,
dashboards, and data pipelines. Users can run regular jobs, launch interactive notebook sessions, view vital metrics,
share work with collaborators, and communicate with their colleagues in the Domino web application.

All Domino components run in Kubernetes. You can run an instance of Domino in the cloud or on-premises in your
office or data center.
Use the links in the sidebar to learn more about the Domino architecture and supported Kubernetes clus-
ter configurations. If you need help setting up a Domino-compatible Kubernetes cluster, send an email to
sales@dominodatalab.com.

3
Domino Admin Docs Documentation, Release 4.4.0

4 Chapter 1. About Domino 4


CHAPTER 2

Architecture

The diagram below shows the physical infrastructure of Domino 4.


Domino runs in a Kubernetes cluster with a standard set of three master nodes, a set of worker nodes dedicated to
hosting Domino platform services, and a set of worker nodes dedicated to hosting compute workloads. Outside the
cluster is a durable blob storage system, and a load balancer that regulates connections from users.

5
Domino Admin Docs Documentation, Release 4.4.0

2.1 Overview

The Domino application hosts two major workloads:


1. Domino Platform
These components provide user interfaces, the Domino API server, orchestration, metadata and supporting
services.
2. Domino Compute
This is where users’ data science, engineering, and machine learning workflows are executed.
All workloads in the Domino application run as containerized processes, orchestrated by Kubernetes. Kubernetes is
an industry-standard container orchestration system. Kubernetes was launched by Google and has broad community
and vendor support, including managed offerings from all major cloud providers.
Typically, Domino customers will provision and manage their own Kubernetes cluster into which they install Domino.
Domino offers professional services for customers who require assistance provisioning a cluster. Please talk to your
account executive for more information about these options.

2.2 Services

Domino services are best understood when arranged into logical layers based on function and communication. A
description of the functionality provided by each layer follows.

6 Chapter 2. Architecture
Domino Admin Docs Documentation, Release 4.4.0

2.2.1 Client layer

The client layer contains the Frontend pods that are the targets of a network load balancer. Domino users can access
Domino’s core features by connecting to the Frontends via:
• Web browser, in which case the Frontend serves the Domino application
• HTTPS request to the Domino API, which the Frontend routes to the API server
• Domino CLI, which uses the API
The Frontends run on platform nodes.

2.2.2 Service layer

The service layer contains the Domino API server, Dispatcher, Keycloak authentication service, and the metadata
services that Domino uses to provide reproducibility and collaboration features. MongoDB stores application object
metadata, Git manages code and file versioning, Elasticsearch powers in-app search, and the Docker registry is used
by Domino Environments. Project data, logs, and backups are written to durable blob storage.
All of these services run on platform nodes.
The service layer also contains the dedicated master nodes for the Kubernetes cluster.

2.2.3 Execution layer

The execution layer is where Domino will launch and manage ephemeral pods that run user workloads. These may
host Jobs, Model APIs, Apps, Workspaces, and Docker image builds.
These run on compute nodes.

2.2. Services 7
Domino Admin Docs Documentation, Release 4.4.0

2.3 Software

The Domino platform runs or depends on the following software components.

2.3.1 Application services

The following primary application services run on platform nodes in the Domino Kubernetes cluster.
• nginx
nginx is an open source HTTP and reverse proxy server. Domino uses NGINX to serve the Domino web
application and as a reverse proxy to route requests to internal services.
Learn more about nginx
• Domino API server
The Domino application exposes the Domino API and handles REST API requests from the web application
and user clients.
• Domino dispatcher
The Domino dispatcher handles orchestration of workloads on compute nodes. The dispatcher launches new
compute pods, connects results telemetry back to the Domino application, and monitors the health of running
workloads.
• Keycloak
Keycloak is an enterprise-grade open source authentication service. Domino uses Keycloak to store user iden-
tities and properties, and optionally for identity brokering or identity federation to SSO systems and identity
providers.
Keycloak supports the following protocols:
– SAML v2.0
– OpenID Connect v1.0
– OAuth v2.0
– LDAP(S)
Learn more about Keycloak

2.3.2 Supporting services

These metadata, communication, and processing services run on platform nodes.

8 Chapter 2. Architecture
Domino Admin Docs Documentation, Release 4.4.0

• MongoDB
MongoDB is an open source document database. Domino uses MongoDB to store Domino entities, like projects,
users, and organizations. Domino stores the structure of these entities in MongoDB, but underlying data is stored
separately in encrypted blob storage.
Learn more about MongoDB
• Git
Git is a free and open source distributed version control system. Domino uses Git internally for revisioning
projects and files. Domino Executors also run Git clients, and they can interact with user-controlled external
repositories to access code or data.
Learn more about Git
• Elasticsearch
Elasticsearch is a distributed, RESTful search and analytics engine. Domino uses Elasticsearch to power user
searches for Domino objects like projects, files, and models. Domino also uses Elasticsearch for logging.
Learn more about Elasticsearch
• Docker registry
The Docker registry is an application used to store and distribute Docker images. Domino uses its registry
to store images for Domino environments and Model APIs. These images are built to user specifications by
compute nodes.
Learn more about Docker registry
• Fluentd
Fluentd is an open source application that unifies and processes logging and telemetry data. Domino uses
Fluentd to aggregate logs and forward data to durable storage.
Learn more about Fluentd
• Redis
Redis is an open source data structure cache. Domino uses Redis to cache logs in-memory for streaming back
to users through the web application.
Learn more about Redis
• RabbitMQ
RabbitMQ is an open source message broker. Domino uses RabbitMQ as an event bus to asynchronously
distribute event messages between Domino services.
Learn more about RabbitMQ
• Postgres
Postgres is an open source relational database system. Domino uses Postgres as a storage system for Keycloak
data on user identities and attributes.
Learn more about Postgres

2.3. Software 9
Domino Admin Docs Documentation, Release 4.4.0

2.4 User accounts

Domino uses Keycloak to manage user accounts. Keycloak supports the following modes of authentication to Domino.

2.4.1 Local accounts

When using local accounts, anyone with network access to the Domino application may create a Domino account.
Users supply a username, password, and email address on the signup page to create a Domino-managed account.
Domino administrators can track, manage, and deactivate these accounts through the application. Domino can be
configured with multi-factor authentication and password requirements through Keycloak.
Learn more about Keycloak administration

2.4.2 Identity federation

Keycloak can be configured to integrate with an Active Directory (AD) or LDAP(S) identity provider (IdP). When
identity federation is enabled, local account creation is disabled and Keycloak will authenticate users against identities
in the external IdP and retrieve configurable properties about those users for Domino usernames and email addresses.
Learn more about Keycloak identity federation

2.4.3 Identity brokering

Keycloak can be configured to broker authentication between Domino and an external authentication or SSO system.
When identity brokering is enabled, Domino will redirect users in the authentication flow to a SAML, OAuth, or OIDC
service for authentication. Following authentication in the external service, the user is routed back to Domino with a
token containing user properties.
Learn more about Keycloak identity brokering

2.5 Service mesh

A service mesh provides a transparent and language-independent way to flexibly and easily automate application
network functions, such as: traffic routing, load balancing, observability, and encryption. Domino can optionally
deploy or integrate with Istio, an open source service mesh. We require Istio 1.7.2+. Istio is required to implement
intra-cluster encryption in transit.
Learn more about Istio

10 Chapter 2. Architecture
CHAPTER 3

Kubernetes

Domino 4 runs in your Kubernetes cluster, and the infrastructure can be managed with Kubernetes native tools like
kubectl.

3.1 Cluster requirements

You can deploy Domino 4 into a Kubernetes cluster that meets the following requirements.

3.1.1 General requirements

• Kubernetes 1.13+
• Cluster permissions
Domino needs permission to install and configure pods in the cluster via Helm. The Domino installer is delivered
as a containerized Python utility that operates Helm through a kubeconfig that provides service account access
to the cluster.
• Three namespaces
Domino creates three dedicated namespaces, one for Platform nodes, one for Compute nodes, and one for
installer metadata and secrets.

11
Domino Admin Docs Documentation, Release 4.4.0

3.1.2 Storage requirements

Storage classes

Domino requires at least two storage classes.


1. Dynamic block storage
Domino requires high performance block storage for the following types of data:
• Ephemeral volumes attached to user execution
• High performance databases for Domino application object data
This storage needs to be backed by a storage class with the following properties:
• Supports dynamic provisioning
• Can be mounted on any node in the cluster
• SSD-backed recommended for fast I/O
• Capable of provisioning volumes of at least 100GB
• Underlying storage provider can support ReadWriteOnce semantics
By default, this storage class is named dominodisk.
In AWS, EBS is used to back this storage class. Consult this example configuration for a compatible EBS
storage class:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: domino-compute-storage
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
fsType: ext4

In GCP, compute engine persistent disks are used to back this storage class. Consult this example configuration
for a compatible GCEPD storage class:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: dominodisk
parameters:
replication-type: none
type: pd-standard
provisioner: kubernetes.io/gce-pd
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

12 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

2. Long term shared storage


Domino needs a separate storage class for long term storage for:
• Project data uploaded or created by users
• Domino Datasets
• Docker images
• Domino backups
This storage needs to be backed by a storage class with the following properties:
• Dynamically provisions Kubernetes PersistentVolume
• Can be accessed in ReadWriteMany mode from all nodes in the cluster
• Uses a VolumeBindingMode of Immediate
In AWS, these storage requirements are handled by two separate classes. One backed by EFS for Domino
Datasets, and one backed by S3 for project data, backups, and Docker images.
In GCP, these storage requirements are handled by a Cloud Filestore volume mounted as NFS.
By default, this storage class is named dominoshared.

Native

For shared storage, we allow for (and even require) native cloud provider object store for a few resources and services:
• Blob Storage. For AWS, the blob storage must be backed by S3 (see Blob storage). For other infrastructure,
the dominoshared storage class is used.
• Logs. For AWS, the log storage must be backed by S3 (see Blob storage). For others, the dominoshared
storage class is used.
• Backups. For all supported cloud providers, storage for backups are backed by the native blob store. For
on-prem, backups is backed by the dominoshared storage class.
– AWS: S3
– Azure: Azure Blob Storage
– GCP: GCP Cloud Storage
• Datasets. For AWS, Datasets storage must be backed by EFS (see Datasets storage). For other infrastructure,
the dominoshared storage class is used.

On-Prem

In on-prem environments, both dominodisk and dominoshared can be backed by NFS. In some cases, host
volumes can be used (and even preferred). Host volumes are preferred for the Git, Postgres, and MongoDB. Postgres
and MongoDB provide state replication. Host volumes can be used for Runs, but not preferred since we want leverage
files cached in block storage that can move between nodes. If host volumes are used for Runs, file caching should be
disabled and you will potentially expect slow start up executions for large Projects.

3.1. Cluster requirements 13


Domino Admin Docs Documentation, Release 4.4.0

Summary

The following table summarizes the storage requirements and options.

3.1.3 Node pool requirements

Domino requires a minimum of two node pools, one to host the Domino Platform and one to host Compute workloads.
Additional optional pools can be added to provide specialized execution hardware for some Compute workloads.
1. Platform pool requirements
• Boot Disk: 128GB
• Min Nodes: 3
• Max Nodes: 3
• Spec: 8 CPU / 32GB
• Labels: dominodatalab.com/node-pool: platform
• Tags:
– kubernetes.io/cluster/{{ cluster_name }}: owned
– k8s.io/cluster-autoscaler/enabled: true #Optional for autodiscovery
– k8s.io/cluster-autoscaler/{{ cluster_name }}: owned #Optional for autodis-
covery

2. Compute pool requirements


• Boot Disk: 400GB
• Recommended Min Nodes: 1
• Max Nodes: Set as necessary to meet demand and resourcing needs
• Recommended min spec: 8 CPU / 32GB
• Enable Autoscaling: Yes
• Labels: domino/build-node: true, dominodatalab.com/node-pool: default
• Tags:
– k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/
node-pool: default
– kubernetes.io/cluster/{{ cluster_name }}: owned
– k8s.io/cluster-autoscaler/node-template/label/domino/build-node:
true
– k8s.io/cluster-autoscaler/enabled: true #Optional for autodiscovery

14 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

– k8s.io/cluster-autoscaler/{{ cluster_name }}: owned #Optional for autodis-


covery

3. Optional GPU compute pool


• Boot Disk: 400GB
• Recommended Min Nodes: 0
• Max Nodes: Set as necessary to meet demand and resourcing needs
• Recommended min Spec: 8 CPU / 16GB / One or more Nvidia GPU Device
• Nodes must be pre-configured with appropriate Nvidia driver, Nvidia-docker2 and set the default docker
runtime to nvidia. For example, EKS GPU optimized AMI.
• Labels: dominodatalab.com/node-pool: default-gpu, nvidia.com/gpu: true
• Tags:
– k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/
node-pool: default-gpu
– kubernetes.io/cluster/{{ cluster_name }}: owned
– k8s.io/cluster-autoscaler/enabled: true #Optional for autodiscovery
– k8s.io/cluster-autoscaler/{{ cluster_name }}: owned #Optional for autodis-
covery

3.1.4 Cluster networking

Domino relies on Kubernetes network policies to manage secure communication between pods in the cluster. Net-
work policies are implemented by the network plugin, so your cluster use a networking solution which supports
NetworkPolicy, such as Calico.

3.1. Cluster requirements 15


Domino Admin Docs Documentation, Release 4.4.0

3.1.5 Ingress and SSL

Domino will need to be configured to serve from a specific FQDN, and DNS for that name should resolve to the address
of an SSL-terminating load balancer with a valid certificate. The load balancer must target incoming connections on
ports 80 and 443 to port 80 on all nodes in the Platform pool.
Health checks for this load balancer should use HTTP on port 80 and check for 200 responses from a path of /health
on the nodes.

3.2 Requirements checker

The Domino Cluster Requirements Checker is a command-line utility that checks if a Kubernetes cluster conforms
to Domino requirements. The Cluster Requirements Checker is a plugin for Sonobuoy, a Kubernetes diagnostic tool.
The instructions on this page are used to run only the Domino plugin, and not the full Kubernetes conformance suite.
The Cloud Native Compute Foundation has certified many Kubernetes offerings. Kubernetes certification steps include
conformance tests run by Sonobuoy. Domino uses the Sonobuoy Plugin Framework to perform customized Domino
conformance checks on a cluster prior to installing Domino.

3.2.1 Instructions

You should perform the following steps from a workstation with kubectl admin access to the target cluster.
1. Install Sonobuoy binaries

If the cluster is running Kubernetes 1.13, install sonobuoy v0.15.4.


If the cluster is running Kubernetes 1.14, install sonobuoy v0.16.2
If the cluster is running Kubernetes 1.15 or above, install the latest sonobuoy release

Run the following command to determine the Kubernetes version for your cluster:

kubectl version

2. Set a KUBECONFIG environment variable to a path to a kubeconfig file with admin access to the target cluster.

export KUBECONFIG=~/.kube/config

3. Create a domino-checker.yaml configuration file with the following contents. You can download this file
from GitHub here

16 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

sonobuoy-config:
driver: DaemonSet
plugin-name: domino
result-format: junit
skip-cleanup: true

spec:
env:
- name: DOCKER_API_VERSION
value: '1.38'
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: RESULTS_DIR
value: /tmp/results
image: quay.io/domino/k8s-validator:latest
imagePullPolicy: Always
name: domino
securityContext:
privileged: false
volumeMounts:
- mountPath: /tmp/results
name: results
readOnly: false
- mountPath: /var/run/docker.sock
name: docker-mnt
readOnly: false

extra-volumes:
- name: docker-mnt
hostPath:
path: /var/run/docker.sock

4. Run the plugin.

1. sonobuoy run -p domino-checker.yaml --wait


2. resultsfile=$(sonobuoy retrieve)
3. sonobuoy results $resultsfile --plugin domino
4. sonobuoy delete --wait

The last instruction is necessary to remove sonobuoy.


You must do this step if you want to run Sonobuoy on the cluster again.

3.2. Requirements checker 17


Domino Admin Docs Documentation, Release 4.4.0

3.2.2 Output example

validator> sonobuoy run -p domino-checker.yaml --wait


WARN[0001] Version v1.14.7-gke.14 is not a stable version, conformance image may not
˓→exist upstream

INFO[0002] created object name=sonobuoy namespace=


˓→resource=namespaces

INFO[0002] created object name=sonobuoy-serviceaccount


˓→namespace=sonobuoy resource=serviceaccounts

INFO[0002] created object name=sonobuoy-serviceaccount-


˓→sonobuoy namespace= resource=clusterrolebindings

INFO[0002] created object name=sonobuoy-serviceaccount


˓→namespace= resource=clusterroles

INFO[0002] created object name=sonobuoy-config-cm


˓→namespace=sonobuoy resource=configmaps

INFO[0002] created object name=sonobuoy-plugins-cm


˓→namespace=sonobuoy resource=configmaps

INFO[0002] created object name=sonobuoy


˓→namespace=sonobuoy resource=pods

INFO[0002] created object name=sonobuoy-master


˓→namespace=sonobuoy resource=services

validator> theFile=$(~/bin/sonobuoy retrieve)


validator> sonobuoy results $theFile --plugin domino
Plugin: domino
Status: failed
Total: 8
Passed: 6
Failed: 2
Skipped: 0

Failed tests:
Node CPU
Node Memory
validator> sonobuoy delete --wait
INFO[0000] deleted kind=namespace
˓→namespace=sonobuoy

INFO[0000] deleted kind=clusterrolebindings


INFO[0000] deleted kind=clusterroles

18 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

3.2.3 Getting more details on failures

Run the following command to get more information about failed checks.
sonobuoy results $resultsfile --plugin domino --mode=dump
The output will look like this.
name: domino
status: failed
Items:
- name: gke-etienne-gke-1-build-13b06f55-8f2l
status: failed
Items:
- name: domino-junit.xml
status: failed
Meta:
file: results/gke-etienne-gke-1-build-13b06f55-8f2l/domino-junit.xml
Items:
- name: Domino Sonobuoy K8s Conformance Plugin
status: failed
Items:
- name: RWX Storage Class Available
status: passed
- name: Default Storage Class Set
status: passed
- name: Helm (Tiller) Service does not exist
status: passed
- name: Node Labels
status: passed
- name: Node CPU
status: failed
Details:
failure: Insufficient 24 required but only 8 of 24 available for Domino
- name: Node Memory
status: failed
Details:
failure: Insufficient 96Gi required but only 30880736Ki of 92642208Ki
˓→available

for Domino
- name: 'Docker Daemon Available: 4.14.145+'
status: passed
- name: gke-etienne-gke-1-compute-a5dfc474-g5s4
status: passed
Items:
- name: domino-junit.xml
status: passed
Meta:
file: results/gke-etienne-gke-1-compute-a5dfc474-g5s4/domino-junit.xml
Items:
- name: Domino Sonobuoy K8s Conformance Plugin
status: passed
- name: gke-etienne-gke-1-platform-a70f6fe2-fcss
status: passed
Items:
- name: domino-junit.xml
status: passed
Meta:
file: results/gke-etienne-gke-1-platform-a70f6fe2-fcss/domino-junit.xml
(continues on next page)

3.2. Requirements checker 19


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


Items:
- name: Domino Sonobuoy K8s Conformance Plugin
status: passed

3.3 Domino on EKS

Domino 4 can run on a Kubernetes cluster provided by AWS Elastic Kubernetes Service. When running on EKS, the
Domino 4 architecture uses AWS resources to fulfill the Domino cluster requirements as follows:

• Kubernetes control moves to the EKS control plane with managed Kubernetes masters
• Domino uses a dedicated Auto Scaling Group (ASG) of EKS workers to host the Domino platform
• ASGs of EKS workers host elastic compute for Domino executions
• AWS S3 is used to store user data, internal Docker registry, backups, and logs
• AWS EFS is used to store Domino Datasets
• The kubernetes.io/aws-ebs provisioner is used to create persistent volumes for Domino executions
• Calico is used as a network plugin to support Kubernetes network policies
• Domino cannot be installed on EKS Fargate, since Fargate does not support stateful workloads with persistent
volumes.
• Instead of EKS Managed Node groups, Domino recommends creating custom node groups to allow for addi-
tional control and customized Amazon Machine Images. Domino recommends eksctl, Terraform, or Cloud-
Formation for setting up custom node groups.

20 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

All nodes in such a deployment have private IPs, and internode traffic is routed by internal load balancer. Nodes in the
cluster can optionally have egress to the Internet through a NAT gateway.

3.3.1 Setting up an EKS cluster for Domino

This section describes how to configure an Amazon EKS cluster for use with Domino.

VPC networking

If you plan to do VPC peering or set up a site-to-site VPN connection to connect your cluster to other resources like
data sources or authentication services, be sure to configure your cluster VPC accordingly to avoid any address space
collisions.

Namespaces

No namespace configuration is necessary prior to install. Domino will create three namespaces in the cluster during
installation, according to the following specifications:

Namespace Contains
platform Durable Domino application, metadata, platform services required for platform operation
compute Ephemeral Domino execution pods launched by user actions in the application
domino-system Domino installation metadata and secrets

Node pools

The EKS cluster must have at least two ASGs that produce worker nodes with the following specifications and distinct
node labels, and it may include an optional GPU pool:

3.3. Domino on EKS 21


Domino Admin Docs Documentation, Release 4.4.0

Pool Min-Max Instance Disk Labels


platform 3-3 m5.2xlarge 128G dominodatalab.
com/node-pool:
platform
default 1-20 m5.2xlarge 400G dominodatalab.com/
node-pool: default
domino/build-node:
true
default-gpu0-5 p3.2xlarge 400G dominodatalab.
(optional) com/node-pool:
default-gpu nvidia.
com/gpu: true

The platform ASG can run in 1 availability zone or across 3 availability zones. If you want Domino to run with
some components deployed as highly available ReplicaSets you must use 3 availability zones. Using 2 zones is not
supported, as it results in an even number of nodes in a single failure domain. Note that all compute node pools you
use should have corresponding ASGs in any AZ used by other node pools. Setting up an isolated node pool in one
zone can cause volume affinity issues.
To run the default and default-gpu pools across multiple availability zones, you will need duplicate ASGs in
each zone with the same configuration, including the same labels, to ensure pods are delivered to the zone where the
required ephemeral volumes are available.
The easiest way to get suitable drivers onto GPU nodes is to use the EKS-optimized AMI distributed by Amazon as
the machine image for the GPU node pool.
Additional ASGs can be added with distinct dominodatalab.com/node-pool labels to make other instance
types available for Domino executions. Read Managing the Domino compute grid to learn how these different node
types are referenced by label from the Domino application.

Network plugin

Domino relies on Kubernetes network policies to manage secure communication between pods in the cluster. Net-
work policies are implemented by the network plugin, so your cluster use a networking solution which supports
NetworkPolicy, such as Calico.
Refer to the AWS documentation on installing Calico for your EKS cluster.
If you use the Amazon VPC CNI for networking, with only NetworkPolicy enforcement components of Calico, you
should ensure the subnets you use for your cluster have CIDR ranges of sufficient size, as every deployed pod in the
cluster will be assigned an elastic network interface and consume a subnet address. Domino recommends at least a /23
CIDR for the cluster.

Docker bridge

By default, AWS AMIs do not have bridge networking enabled for Docker containers. Domino requires this for
environment builds. Add --enable-docker-bridge true to the user data of the launch configuration used by

22 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

all Domino ASG nodes.


1. Create a copy of the launch configuration used by each Domino ASG.
2. Open the User data field and add --enable-docker-bridge true to the copied launch configuration.
3. Switch the Domino ASGs to use the new launch configuration.
4. Drain any existing nodes in the ASG.

Dynamic block storage

The EKS cluster must be equipped with an EBS-backed storage class that Domino will use to provision ephemeral
volumes for user execution. Consult the following storage class specification as an example.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: domino-compute-storage
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
fsType: ext4

Datasets storage

In order to store Datasets in Domino, an EFS (Elastic File System) must be configured. The EFS file system must be
provisioned and an access point configured to allow access from the EKS cluster.
Configure the access point with the following key parameters, also shown in the image below.
• Root directory path: /domino
• User ID: 0
• Group ID: 0
• Owner user ID: 0
• Owner group ID: 0
• Root permissions: 777

3.3. Domino on EKS 23


Domino Admin Docs Documentation, Release 4.4.0

Record the file system and access point IDs for use when installing Domino.

24 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

Blob storage

When running in EKS, Domino can use Amazon S3 for durable object storage.
Create the following three S3 buckets:
• 1 bucket for user data
• 1 bucket for internal Docker registry
• 1 bucket for logs
• 1 bucket for backups
Configure each bucket to permit read and write access from the EKS cluster. This involves applying an IAM policy to
the nodes in the cluster like the following:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads"
],
"Resource": [
"arn:aws:s3:::$your-logs-bucket-name",
"arn:aws:s3:::$your-backups-bucket-name",
"arn:aws:s3:::$your-user-data-bucket-name",
"arn:aws:s3:::$your-registry-bucket-name"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload"
],
"Resource": [
"arn:aws:s3:::$your-logs-bucket-name/*",
"arn:aws:s3:::$your-backups-bucket-name/*",
"arn:aws:s3:::$your-user-data-bucket-name/*",
"arn:aws:s3:::$your-registry-bucket-name/*"
]
}
]
}

Record the names of these buckets for use when installing Domino.

3.3. Domino on EKS 25


Domino Admin Docs Documentation, Release 4.4.0

Autoscaling access

If you intend to deploy the Kubernetes Cluster Autoscaler in your cluster, the instance profile used by your platform
nodes must have the necessary AWS Auto Scaling permissions.
See the following example policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeLaunchTemplateVersions",
"ec2:DescribeInstanceTypes"
],
"Resource": "*",
"Effect": "Allow"
}
]
}

Domain

Domino will need to be configured to serve from a specific FQDN. To serve Domino securely over HTTPS, you will
also need an SSL certificate that covers the chosen name. Record the FQDN for use when installing Domino.

Checking your EKS cluster

If you’ve applied the configurations described above to your EKS cluster, it should be able to run the Domino cluster
requirements checker without errors. If the checker runs successfully, you are ready for Domino to be installed in the
cluster.

26 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

3.3.2 Sample cluster configuration

See below for a sample YAML configuration file you can use with eksctl, the official EKS command line tool, to create
a Domino-compatible cluster.
Note that after creating a cluster with this configuration, you must still create the EFS and S3 storage systems and
configure them for access from the cluster as described above.

# $LOCAL_DIR/cluster.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: domino-test-cluster
region: us-west-2

nodeGroups:
- name: domino-platform
instanceType: m5.2xlarge
minSize: 3
maxSize: 3
desiredCapacity: 3
volumeSize: 128
availabilityZones: ["us-west-2a"]
labels:
"dominodatalab.com/node-pool": "platform"
tags:
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

- name: domino-default
instanceType: m5.2xlarge
minSize: 0
maxSize: 10
desiredCapacity: 1
volumeSize: 400
availabilityZones: ["us-west-2a"]
labels:
"dominodatalab.com/node-pool": "default"
"domino/build-node": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default"

"k8s.io/cluster-autoscaler/node-template/label/domino/build-node": "true"
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

preBootstrapCommands:
- "cp /etc/docker/daemon.json /etc/docker/daemon_backup.json"
- "echo -e '.bridge=\"docker0\" | .\"live-restore\"=false' > /etc/docker/jq_
˓→script"

- "jq -f /etc/docker/jq_script /etc/docker/daemon_backup.json | tee /etc/docker/


˓→daemon.json"

- "systemctl restart docker"


- name: domino-gpu
instanceType: p2.8xlarge
minSize: 0
(continues on next page)

3.3. Domino on EKS 27


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


maxSize: 5
volumeSize: 400
availabilityZones: ["us-west-2a"]
ami:
ami-0ad9a8dc09680cfc2
labels:
"dominodatalab.com/node-pool": "default-gpu"
"nvidia.com/gpu": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default-gpu"

"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery


"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2c"]

For more information on autodiscovery see our Configuration Reference

3.3.3 Sample cluster configuration for multiple AZ

See below for a sample YAML configuration file you can use with eksctl, the official EKS command line tool, to create
a Domino-compatible cluster spanning multiple availability zones. Note that in order to avoid issues with execution
volume affinity, you must create duplicate groups in each AZ.
# $LOCAL_DIR/cluster.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: domino-test-cluster
region: us-west-2

nodeGroups:
- name: domino-platform-a
instanceType: m5.2xlarge
minSize: 1
maxSize: 3
desiredCapacity: 1
volumeSize: 128
availabilityZones: ["us-west-2a"]
labels:
"dominodatalab.com/node-pool": "platform"
tags:
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
(continues on next page)

28 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

nodeGroups:
- name: domino-platform-b
instanceType: m5.2xlarge
minSize: 1
maxSize: 3
desiredCapacity: 1
volumeSize: 128
availabilityZones: ["us-west-2b"]
labels:
"dominodatalab.com/node-pool": "platform"
tags:
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

nodeGroups:
- name: domino-platform-c
instanceType: m5.2xlarge
minSize: 1
maxSize: 3
desiredCapacity: 1
volumeSize: 128
availabilityZones: ["us-west-2c"]
labels:
"dominodatalab.com/node-pool": "platform"
tags:
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

- name: domino-default-a
instanceType: m5.2xlarge
minSize: 0
maxSize: 3
volumeSize: 400
availabilityZones: ["us-west-2a"]
labels:
"dominodatalab.com/node-pool": "default"
"domino/build-node": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default"

"k8s.io/cluster-autoscaler/node-template/label/domino/build-node": "true"
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

preBootstrapCommands:
- "cp /etc/docker/daemon.json /etc/docker/daemon_backup.json"
- "echo -e '.bridge=\"docker0\" | .\"live-restore\"=false' > /etc/docker/jq_
˓→script"

- "jq -f /etc/docker/jq_script /etc/docker/daemon_backup.json | tee /etc/docker/


˓→daemon.json"

- "systemctl restart docker"


- name: domino-default-b
instanceType: m5.2xlarge
minSize: 0
maxSize: 3
(continues on next page)

3.3. Domino on EKS 29


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


volumeSize: 400
availabilityZones: ["us-west-2b"]
labels:
"dominodatalab.com/node-pool": "default"
"domino/build-node": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default"

"k8s.io/cluster-autoscaler/node-template/label/domino/build-node": "true"
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

preBootstrapCommands:
- "cp /etc/docker/daemon.json /etc/docker/daemon_backup.json"
- "echo -e '.bridge=\"docker0\" | .\"live-restore\"=false' > /etc/docker/jq_
˓→script"

- "jq -f /etc/docker/jq_script /etc/docker/daemon_backup.json | tee /etc/docker/


˓→daemon.json"

- "systemctl restart docker"


- name: domino-default-c
instanceType: m5.2xlarge
minSize: 0
maxSize: 3
volumeSize: 400
availabilityZones: ["us-west-2c"]
labels:
"dominodatalab.com/node-pool": "default"
"domino/build-node": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default"

"k8s.io/cluster-autoscaler/node-template/label/domino/build-node": "true"
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

preBootstrapCommands:
- "cp /etc/docker/daemon.json /etc/docker/daemon_backup.json"
- "echo -e '.bridge=\"docker0\" | .\"live-restore\"=false' > /etc/docker/jq_
˓→script"

- "jq -f /etc/docker/jq_script /etc/docker/daemon_backup.json | tee /etc/docker/


˓→daemon.json"

- "systemctl restart docker"


- name: domino-gpu-a
instanceType: p2.8xlarge
minSize: 0
maxSize: 2
volumeSize: 400
availabilityZones: ["us-west-2a"]
ami:
ami-0ad9a8dc09680cfc2
labels:
"dominodatalab.com/node-pool": "default-gpu"
"nvidia.com/gpu": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default-gpu"

"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery


(continues on next page)

30 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

- name: domino-gpu-b
instanceType: p2.8xlarge
minSize: 0
maxSize: 2
volumeSize: 400
availabilityZones: ["us-west-2b"]
ami:
ami-0ad9a8dc09680cfc2
labels:
"dominodatalab.com/node-pool": "default-gpu"
"nvidia.com/gpu": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default-gpu"

"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery


"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

- name: domino-gpu-c
instanceType: p2.8xlarge
minSize: 0
maxSize: 2
volumeSize: 400
availabilityZones: ["us-west-2c"]
ami:
ami-0ad9a8dc09680cfc2
labels:
"dominodatalab.com/node-pool": "default-gpu"
"nvidia.com/gpu": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default-gpu"

"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery


"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>

availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2c"]

For more information on autodiscovery see our Configuration Reference

3.4 Domino on GKE

Domino 4 can run on a Kubernetes cluster provided by the Google Kubernetes Engine (GKE).

3.4. Domino on GKE 31


Domino Admin Docs Documentation, Release 4.4.0

3.4.1 Overview

When running on GKE, the Domino 4 architecture uses GCP resources to fulfill the Domino cluster requirements as
follows:
• Kubernetes control is managed by the GKE cluster
• Domino uses one node pool of three n1-standard-8 worker nodes to host the Domino platform
• Additional node pools host elastic compute for Domino executions with optional GPU accelerators
• Cloud Filestore is used to store user data, backups, logs, and Domino Datasets
• A Cloud Storage Bucket is used to store the Domino Docker Registry.
• The kubernetes.io/gce-pd provisioner is used to create persistent volumes for Domino executions.

3.4.2 Setting up a GKE cluster for Domino

This section describes how to configure an GKE cluster for use with Domino.

Namespaces

No namespace configuration is necessary prior to install. Domino will create three namespaces in the cluster during
installation, according to the following specifications:

Namespace Contains
platform Durable Domino application, metadata, platform services required for platform operation
compute Ephemeral Domino execution pods launched by user actions in the application
domino-system Domino installation metadata and secrets

Node pools

The GKE cluster must have at least two node pools that produce worker nodes with the following specifications and
distinct node labels, and it may include an optional GPU pool:

32 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

Pool Min-Max Instance Disk Labels


platform 3-3 n1-standard-8 128G dominodatalab.
com/node-pool:
platform
default 1-20 n1-standard-8 400G dominodatalab.
com/node-pool:
default domino/
build-node: true
default-gpu0-5 n1-standard-8 400G dominodatalab.
(optional) com/node-pool:
default-gpu

If you want to configure the default-gpu pool, you must add a GPU accelerator the node pool. Read the GKE doc-
umentation on available accelerators and on deploying a DaemonSet that automatically installs the necessary drivers.
Additional node pools can be added with distinct dominodatalab.com/node-pool labels to make other in-
stance types available for Domino executions. Read Managing the Domino compute grid to learn how these different
node types are referenced by label from the Domino application.
Consult the Terraform snippets below for code representations of the required node pools.
Platform pool

resource "google_container_node_pool" "platform" {


name = "platform"
location = $YOUR_CLUSTER_ZONE_OR_REGION
cluster = $YOUR_CLUSTER_NAME

initial_node_count = 3
autoscaling {
max_node_count = 3
min_node_count = 3
}

node_config {
preemptible = false
machine_type = "n1-standard-8"

labels = {
"dominodatalab.com/node-pool" = "platform"
}

disk_size_gb = 128
local_ssd_count = 1
}

management {
auto_repair = true
auto_upgrade = true
}

timeouts {
delete = "20m"
}
}

Default compute pool

3.4. Domino on GKE 33


Domino Admin Docs Documentation, Release 4.4.0

resource "google_container_node_pool" "compute" {


name = "compute"
location = $YOUR_CLUSTER_ZONE_OR_REGION
cluster = $YOUR_CLUSTER_NAME

initial_node_count = 1
autoscaling {
max_node_count = 20
min_node_count = 1
}

node_config {
preemptible = false
machine_type = "n1-standard-8"

labels = {
"domino/build-node" = "true"
"dominodatalab.com/build-node" = "true"
"dominodatalab.com/node-pool" = "default"
}

disk_size_gb = 400
local_ssd_count = 1
}

management {
auto_repair = true
auto_upgrade = true
}

timeouts {
delete = "20m"
}
}

Optional GPU pool

resource "google_container_node_pool" "gpu" {


provider = google-beta
name = "gpu"
location = $YOUR_CLUSTER_ZONE_OR_REGION
cluster = $YOUR_CLUSTER_NAME

initial_node_count = 0

autoscaling {
max_node_count = 5
min_node_count = 0
}

node_config {
preemptible = false
machine_type = "n1-standard-8"

guest_accelerator {
type = "nvidia-tesla-p100"
count = 1
(continues on next page)

34 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


}

labels = {
"dominodatalab.com/node-pool" = "default-gpu"
}

disk_size_gb = 400
local_ssd_count = 1

workload_metadata_config {
node_metadata = "GKE_METADATA_SERVER"
}
}

management {
auto_repair = true
auto_upgrade = true
}

timeouts {
delete = "20m"
}
}

Network policy enforcement

Domino relies on Kubernetes network policies to manage secure communication between pods in the cluster. By
default, the network plugin in GKE will not enforce these policies. To run Domino securely on GKE, you must enable
enforcement of network policies.
Read the GKE documentation for instructions on enabling network policy enforcement for your cluster.

Dynamic block storage

The Domino installer will automatically create a storage class like the example below for use provisioning GCE
persistent disks as execution volumes. No manual setup is necessary for this storage class.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: dominodisk
parameters:
replication-type: none
type: pd-standard
provisioner: kubernetes.io/gce-pd
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

3.4. Domino on GKE 35


Domino Admin Docs Documentation, Release 4.4.0

Shared storage

A Cloud Filestore instance must be provisioned with at least 10T of capacity and it must be configured to allow access
from the cluster. You will provide the IP address and mount path of this instance to the Domino installer, and it will
create an NFS storage class like the below.

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
labels:
app.kubernetes.io/instance: nfs-client-provisioner
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: nfs-client-provisioner
helm.sh/chart: nfs-client-provisioner-1.2.6-0.1.4
name: domino-shared
parameters:
archiveOnDelete: "false"
provisioner: cluster.local/nfs-client-provisioner
reclaimPolicy: Delete
volumeBindingMode: Immediate

Docker registry storage

You will need one Cloud Storage Bucket accessible from your cluster to be used for storing the internal Domino
Docker Registry.

Domain

Domino will need to be configured to serve from a specific FQDN. To serve Domino securely over HTTPS, you will
also need an SSL certificate that covers the chosen name. Record the FQDN for use when installing Domino. Once
Domino is deployed into your cluster, you must set up DNS for this name to point to an HTTPS Cloud Load Balancer
that has an SSL certificate for the chosen name, and forwards traffic to port 80 on your platform nodes.

Checking your GKE cluster

If you’ve applied the configurations described above to your GKE cluster, it should be able to run the Domino cluster
requirements checker without errors. If the checker runs successfully, you are ready for Domino to be installed in the
cluster.

36 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

3.5 Domino on AKS

Domino 4 can run on a Kubernetes cluster provided by the Azure Kubernetes Service. When running on AKS, the
Domino 4 architecture uses Azure resources to fulfill the Domino cluster requirements as follows:

• For a complete Terraform module for Domino-compatible AKS provisioning, see terraform-azure-aks on
GitHub.
• Kubernetes control is handled by the AKS control plane with managed Kubernetes masters
• The AKS cluster’s default node pool is configured to host the Domino platform
• Additional AKS node pools provide compute nodes for user workloads
• An Azure storage account stores Domino blob data and datasets
• The kubernetes.io/azure-disk provisioner is used to create persistent volumes for Domino executions
• The Advanced Azure CNI is used for cluster networking, with network policy enforcement handled by Calico
• Ingress to the Domino application is handled by an SSL-terminating Application Gateway that points to a Ku-
bernetes load balancer
• Domino recommends provisioning with Terraform for extended control and customizability of all re-
sources. When setting up your Azure Terraform provider, please add a partner_id with a value of
31912fbf-f6dd-5176-bffb-0a01e8ac71f2 to enable usage attribution.

3.5. Domino on AKS 37


Domino Admin Docs Documentation, Release 4.4.0

3.5.1 Setting up an AKS cluster for Domino

This section describes how to configure an AKS cluster for use with Domino.

Resource groups

You can provision the cluster, storage, and application gateway in an existing resource group. Note that in the process
of creating the cluster itself, Azure will create a separate resource group that will contain the cluster components
themselves.

Namespaces

No namespace configuration is necessary prior to install. Domino will create three namespaces in the cluster during
installation, according to the following specifications:

Namespace Contains
platform Durable Domino application, metadata, platform services required for platform operation
compute Ephemeral Domino execution pods launched by user actions in the application
domino-system Domino installation metadata and secrets

Node pools

The AKS cluster’s initial default node pool can be sized and configured to host the must have at least two node pools
that produce worker nodes with the following specifications and distinct node labels, and it may include an optional
GPU pool:

38 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

Pool Min-Max VM Disk Labels


platform 1-4 Standard_DS5_v2
128G dominodatalab.
com/node-pool:
platform
default 1-20 Standard_DS4_v2
128G dominodatalab.com/
node-pool: default
domino/build-node:
true
default-gpu0-5 Standard_NC6 128G dominodatalab.
(optional) com/node-pool:
default-gpu nvidia.
com/gpu: true

The recommended architecture configures the cluster’s initial default node pool with the correct label and size to serve
as the platform node pool. See the below cluster Terraform resource for a complete example.

resource "azurerm_kubernetes_cluster" "aks" {

name = example_cluster
enable_pod_security_policy = false
location = "East US"
resource_group_name = "example_resource_group"
dns_prefix = "example_cluster"
private_cluster_enabled = false

default_node_pool {
enable_node_public_ip = false
name = "platform"
node_count = 4
node_labels = { "dominodatalab.com/node-pool" : "platform" }
vm_size = "Standard_DS5_v2"
availability_zones = ["1", "2", "3"]
max_pods = 250
os_disk_size_gb = 128
node_taints = []
enable_auto_scaling = true
min_count = 1
max_count = 4
}

network_profile {
load_balancer_sku = "Standard"
network_plugin = "azure"
network_policy = "calico"
dns_service_ip = "100.97.0.10"
docker_bridge_cidr = "172.17.0.1/16"
service_cidr = "100.97.0.0/16"
}

A separate node pool for Domino default compute should be added after the cluster is created. Note that this is not
the initial cluster default node pool, but a separate node pool named default that is added to serve default Domino
compute. See the below node pool Terraform resource for a complete example.

3.5. Domino on AKS 39


Domino Admin Docs Documentation, Release 4.4.0

resource "azurerm_kubernetes_cluster_node_pool" "aks" {

enable_node_public_ip = false
kubernetes_cluster_id = "example_cluster_id"
name = "default"
node_count = 1
vm_size = "Standard_DS4_v2"
availability_zones = ["1", "2", "3"]
max_pods = 250
os_disk_size_gb = 128
os_type = "Linux"
node_labels = {
"domino/build-node" = "true"
"dominodatalab.com/build-node" = "true"
"dominodatalab.com/node-pool" = "default"
}
node_taints = []
enable_auto_scaling = true
min_count = 1
max_count = 20

Additional node pools can be added with distinct dominodatalab.com/node-pool labels to make other in-
stance types available for Domino executions. Read Managing the Domino compute grid to learn how these different
node types are referenced by label from the Domino application. When adding GPU node pools, keep in mind the
Azure guidance and best practices on using GPU nodes in AKS.

Network plugin

The Domino-hosting cluster should use the Advanced Azure CNI with network policy enforcement by Calico. See the
below network_profile configuration example.

network_profile {
load_balancer_sku = "Standard"
network_plugin = "azure"
network_policy = "calico"
dns_service_ip = "100.97.0.10"
docker_bridge_cidr = "172.17.0.1/16"
service_cidr = "100.97.0.0/16"
}

Dynamic block storage

AKS clusters come equipped with several kubernetes.io/azure-disk backed storage classes by default.
Domino requires use of premium disks for adequate input and output performance. The managed-premium class
that is created by default can be used. Consult the following storage class specification as an example.

40 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
labels:
kubernetes.io/cluster-service: "true"
name: managed-premium
selfLink: /apis/storage.k8s.io/v1/storageclasses/managed-premium
parameters:
cachingmode: ReadOnly
kind: Managed
storageaccounttype: Premium_LRS
reclaimPolicy: Delete
volumeBindingMode: Immediate

Persistent blob and data storage

Domino uses one Azure storage account for both blob data and files. See the below configuration for the two resources
required, the storage account itself and a blob container inside the account.

resource "azurerm_storage_account" "domino" {


name = "example_storage_account"
resource_group_name = "example_resource_group"
location = "East US"
account_kind = "StorageV2"
account_tier = "Standard"
account_replication_type = "LRS"
access_tier = "Hot"
}

resource "azurerm_storage_container" "domino_registry" {


name = "docker"
storage_account_name = "example_storage_account"
container_access_type = "private"
}

Record the names of these resources for use when installing Domino.

Domain

Domino will need to be configured to serve from a specific FQDN. To serve Domino securely over HTTPS, you will
also need an SSL certificate that covers the chosen name. Record the FQDN for use when installing Domino.

3.5. Domino on AKS 41


Domino Admin Docs Documentation, Release 4.4.0

Checking your AKS cluster

If you’ve applied the configurations described above to your AKS cluster, it should be able to run the Domino cluster
requirements checker without errors. If the checker runs successfully, you are ready for Domino to be installed in the
cluster.

3.5.2 Example installer configuration

See below for an example configuration file for the Domino installer based on the provisioning examples above.
schema: '1.0'
name: domino-deployment
version: 4.1.9
hostname: domino.example.org
pod_cidr: '100.97.0.0/16'
ssl_enabled: true
ssl_redirect: true
request_resources: true
enable_network_policies: true
enable_pod_security_policies: true
create_restricted_pod_security_policy: true
namespaces:
platform:
name: domino-platform
annotations: {}
labels:
domino-platform: 'true'
compute:
name: domino-compute
annotations: {}
labels: {}
system:
name: domino-system
annotations: {}
labels: {}
ingress_controller:
create: true
gke_cluster_uuid: ''
storage_classes:
block:
create: false
name: managed-premium
type: azure-disk
access_modes:
- ReadWriteOnce
base_path: ''
default: false
(continues on next page)

42 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


shared:
create: true
name: dominoshared
type: azure-file
access_modes:
- ReadWriteMany
efs:
region: ''
filesystem_id: ''
nfs:
server: ''
mount_path: ''
mount_options: []
azure_file:
storage_account: 'example_storage_account'
blob_storage:
projects:
type: shared
s3:
region: ''
bucket: ''
sse_kms_key_id: ''
azure:
account_name: ''
account_key: ''
container: ''
gcs:
bucket: ''
service_account_name: ''
project_name: ''
logs:
type: shared
s3:
region: ''
bucket: ''
sse_kms_key_id: ''
azure:
account_name: ''
account_key: ''
container: ''
gcs:
bucket: ''
service_account_name: ''
project_name: ''
backups:
type: shared
s3:
region: ''
bucket: ''
sse_kms_key_id: ''
azure:
account_name: ''
account_key: ''
container: ''
gcs:
bucket: ''
service_account_name: ''
(continues on next page)

3.5. Domino on AKS 43


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


project_name: ''
default:
type: shared
s3:
region: ''
bucket: ''
sse_kms_key_id: ''
azure:
account_name: ''
account_key: ''
container: ''
gcs:
bucket: ''
service_account_name: ''
project_name: ''
enabled: true
autoscaler:
enabled: false
cloud_provider: azure
groups:
- name: ''
min_size: 0
max_size: 0
aws:
region: ''
azure:
resource_group: ''
subscription_id: ''
spotinst_controller:
enabled: false
token: ''
account: ''
external_dns:
enabled: false
provider: aws
domain_filters: []
zone_id_filters: []
git:
storage_class: managed-premium
email_notifications:
enabled: false
server: smtp.customer.org
port: 465
encryption: ssl
from_address: domino@customer.org
authentication:
username: ''
password: ''
monitoring:
prometheus_metrics: true
newrelic:
apm: false
infrastructure: false
license_key: ''
helm:
tiller_image: gcr.io/kubernetes-helm/tiller
appr_registry: quay.io
(continues on next page)

44 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


appr_insecure: false
appr_username: '$QUAY_USERNAME'
appr_password: '$QUAY_PASSWORD'
private_docker_registry:
server: quay.io
username: '$QUAY_USERNAME'
password: '$QUAY_PASSWORD'
internal_docker_registry:
s3_override:
region: ''
bucket: ''
sse_kms_key_id: ''
gcs_override:
bucket: ''
service_account_name: ''
project_name: ''
azure_blobs_override:
account_name: 'example_storage_account'
account_key: 'example_storage_account_key'
container: 'docker'
telemetry:
intercom:
enabled: false
mixpanel:
enabled: false
token: ''
gpu:
enabled: false
fleetcommand:
enabled: false
api_token: ''
teleport:
acm_arn: arn:aws:acm:<region>:<account>:certificate/<id>
enabled: false
hostname: teleport-domino.example.org

3.6 Domino on OpenShift

Starting with Domino 4.3.1, the Domino platform is compatible to run on the OpenShift Container Platform (OCP)
and OpenShift Kubernetes Engine (OKE). Domino supports the OCP/OKE version 4.4+.

3.6.1 Setting up an OpenShift cluster for Domino

This section describes how to configure an OpenShift Kubernetes Engine cluster for use with Domino.

3.6. Domino on OpenShift 45


Domino Admin Docs Documentation, Release 4.4.0

Namespaces

No namespace configuration is necessary prior to install. Domino will create three namespaces in the cluster during
installation, according to the following specifications:

Namespace Contains
platform Durable Domino application, metadata, platform services required for platform operation
compute Ephemeral Domino execution pods launched by user actions in the application
domino-system Domino installation metadata and secrets

Node pools

The OpenShift cluster must have worker nodes with the following specifications and distinct node labels, and it may
include an optional GPU pool:

Pool Min-Max vCPU Memory Disk Labels


platform Min 3 8 32G 128G dominodatalab.
com/node-pool:
platform
default 1-20 8 32G 400G dominodatalab.
com/node-pool:
default domino/
build-node:
true
default-gpu0-5 8 32G 400G dominodatalab.
(optional) com/node-pool:
default-gpu
nvidia.com/gpu:
true

More generally, the platform worker nodes need an aggregate minimum of 24 CPUs and 96G of memory. Spreading
the resources across multiple nodes with proper failure isolation (e.g. availability zones) is recommended.
Managing nodes and node pools in OpenShift is done through Machine Management and the Machine API. For
each node pool above, you will need to create a MachineSet. Be sure to provide the Domino required labels in the
Machine spec (spec.template.spec.metadata.labels stanza). Also, update any provider spec per your
infrastructure provider of choice and sizing (spec.template.spec.providerSpec stanza); for example, in
AWS, updates may include, but not limited to: AMI ID, block device storaage sizing, and availability zone placement.
The following is an example MachineSet for the platform node pool:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
labels:
machine.openshift.io/cluster-api-cluster: firestorm-dxcpd
(continues on next page)

46 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


name: firestorm-dxcpd-platform-us-west-1a
namespace: openshift-machine-api
spec:
replicas: 3
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: firestorm-dxcpd
machine.openshift.io/cluster-api-machineset: firestorm-dxcpd-platform-us-west-1a
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: firestorm-dxcpd
machine.openshift.io/cluster-api-machine-role: platform
machine.openshift.io/cluster-api-machine-type: platform
machine.openshift.io/cluster-api-machineset: firestorm-dxcpd-platform-us-west-1a
spec:
metadata:
labels:
node-role.kubernetes.io/default: ""
dominodatalab.com/node-pool: platform
providerSpec:
value:
ami:
id: ami-02b6556210798d665
apiVersion: awsproviderconfig.openshift.io/v1beta1
blockDevices:
- ebs:
iops: 0
volumeSize: 120
volumeType: gp2
credentialsSecret:
name: aws-cloud-credentials
deviceIndex: 0
iamInstanceProfile:
id: firestorm-dxcpd-worker-profile
instanceType: m5.2xlarge
kind: AWSMachineProviderConfig
metadata:
creationTimestamp: null
placement:
availabilityZone: us-west-1a
region: us-west-1
publicIp: null
securityGroups:
- filters:
- name: tag:Name
values:
- firestorm-dxcpd-worker-sg
subnet:
filters:
- name: tag:Name
values:
- firestorm-dxcpd-private-us-west-1a
tags:
- name: kubernetes.io/cluster/firestorm-dxcpd
value: owned
userDataSecret:
(continues on next page)

3.6. Domino on OpenShift 47


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


name: worker-user-data

The following is an example MachineSet for the default (compute) node pool:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
labels:
machine.openshift.io/cluster-api-cluster: firestorm-dxcpd
name: firestorm-dxcpd-default-us-west-1a
namespace: openshift-machine-api
spec:
replicas: 3
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: firestorm-dxcpd
machine.openshift.io/cluster-api-machineset: firestorm-dxcpd-default-us-west-1a
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: firestorm-dxcpd
machine.openshift.io/cluster-api-machine-role: default
machine.openshift.io/cluster-api-machine-type: default
machine.openshift.io/cluster-api-machineset: firestorm-dxcpd-default-us-west-1a
spec:
metadata:
labels:
node-role.kubernetes.io/default: ""
dominodatalab.com/node-pool: default
domino/build-node: "true"
providerSpec:
value:
ami:
id: ami-02b6556210798d665
apiVersion: awsproviderconfig.openshift.io/v1beta1
blockDevices:
- ebs:
iops: 0
volumeSize: 400
volumeType: gp2
credentialsSecret:
name: aws-cloud-credentials
deviceIndex: 0
iamInstanceProfile:
id: firestorm-dxcpd-worker-profile
instanceType: m5.2xlarge
kind: AWSMachineProviderConfig
metadata:
creationTimestamp: null
placement:
availabilityZone: us-west-1a
region: us-west-1
publicIp: null
securityGroups:
- filters:
- name: tag:Name
values:
(continues on next page)

48 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


- firestorm-dxcpd-worker-sg
subnet:
filters:
- name: tag:Name
values:
- firestorm-dxcpd-private-us-west-1a
tags:
- name: kubernetes.io/cluster/firestorm-dxcpd
value: owned
userDataSecret:
name: worker-user-data

Node Autoscaling

For clusters on top of elastic cloud provider, node autoscaling (or Machine autoscaling) is achieved by creating Clus-
terAutoscaler and MachineAutoscaler resources.
The following is an example ClusterAutoscaler:

apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
name: "default"
spec:
podPriorityThreshold: -10
resourceLimits:
maxNodesTotal: 20
cores:
min: 8
max: 256
memory:
min: 4
max: 256
gpus:
- type: nvidia.com/gpu
min: 0
max: 16
- type: amd.com/gpu
min: 0
max: 4
scaleDown:
enabled: true
delayAfterAdd: 10m
delayAfterDelete: 5m
delayAfterFailure: 30s
unneededTime: 10m

The following is an example MachineAutoscaler for the MachineSet created for the default node pool:

apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
name: "firestorm-dxcpd-default-us-west-1a"
namespace: "openshift-machine-api"
spec:
(continues on next page)

3.6. Domino on OpenShift 49


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


minReplicas: 1
maxReplicas: 5
scaleTargetRef:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
name: firestorm-dxcpd-default-us-west-1a

Storage

See the Storage requirements for your infrastructure.

Networking

Domain

Domino will need to be configured to serve from a specific FQDN. To serve Domino securely over HTTPS, you will
also need an SSL certificate that covers the chosen name.

Network Plugin

Domino relies on Kubernetes network policies to manage secure communication between pods in the cluster. By
default, OpenShift uses the Cluster Network Operator to deploy the OpenShift SDN default CNI network provider
plugin, which support network policies and hence should just work.

Ingress

Domino uses the NGNIX ingress controller maintained by the Kubernetes project instead of (but does not replace) the
OpenShift implemented HAProxy-based ingress controller and deploys the ingress controller as a node port service.
By default, the ingress listens on node ports 443 (HTTPS) and 80 (HTTP).

Load Balancer

A load balancer should be set up to use your DNS name. For example, in AWS, you will need to setup the DNS so it
points a CNAME at an Elastic Load Balancer.
After you complete the installation process, you must configure the load balancer to balance across the platform nodes
at the ports specified by your ingress.

50 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

External Resources

If you plan to connect your cluster to other resources like data sources or authentication services, pods running on the
cluster should have network connectivity to those resources.

Container Registry

Domino deploys its own container image registry instead of using the OpenShift built in container image registry.
During installation, the OpenShift cluster image configuration is modified to trust the Domino certificate authority
(CA). This is done to ensure that OpenShift can run pods using Domino’s custom built images. In the images.
config.openshift.io/cluster resource, you can find a reference to a ConfigMap that contains the Domino
CA.

spec:
additionalTrustedCA:
name: domino-deployment-registry-config

Checking your OpenShift cluster

If you’ve applied the configurations described above to your OpenShift cluster, it should be able to run the Domino
cluster requirements checker without errors. If the checker runs successfully, you are ready for Domino to be installed
in the cluster.

3.7 NVIDIA DGX in Domino

NVIDIA DGX systems can run Domino workloads if they are added to your Kubernetes cluster as compute (worker)
nodes. Read below for how to setup and add DGXes to Domino.

3.7. NVIDIA DGX in Domino 51


Domino Admin Docs Documentation, Release 4.4.0

The flow chart begins from the top left, with a Domino end user requesting a GPU tier.
If a DGX is already configured for use in Domino’s Compute Grid, the Domino platform administrator can define a
GPU-enabled Hardware Tier from within the Admin console.
The middle lane of the flow chart outlines the steps required to integrate a provisioned DGX system as a node in the
Kubernetes cluster that is hosting Domino, and subsequently configure that node as a GPU-enabled component of
Domino’s compute grid.
The bottom swim lane outlines that, to leverage a Nvidia DGX system with Domino, it must be purchased and provi-
sioned into the target infrastructure stack hosting Domino.

3.7.1 Preparing & Install DGX System(s)

Nvidia DGX systems can be purchased through Nvidia’s Partner Network.. Install the DGX system in a hosting
environment with network access to additional host & storage infrastructure required to host Domino.

52 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

3.7.2 Configure DGX System for Domino

Option A: New Kubernetes Cluster & Domino Install

If this is a new (greenfield) deployment of Domino, one must first install & configure a Kubernetes cluster that meets
Domino’s Cluster Requirements, including valid configuration of your Kubernetes’ network policies to support secure
communication between pods that will host Domino’s platform services and compute grid.

Option B: Existing Kubernetes Cluster and/or Domino Installation

Adding a DGX to an existing Domino is as simple as adding the DGX to your K8s API server as a worker node, with
a node label consistent with your chosen naming conventions. The default node label for GPU-based worker nodes is
‘default-gpu’.
Additionally, proper taints must be added to your DGX node. This facilitates the selection of the DGX for GPU-based
workloads running on Domino.

Configuring a Domino Hardware Tier to leverage your configured DGX Compute Node

Now that the DGX is added to your API server and labeled properly, we can move on to configuration of Domino
Hardware Tiers from within Domino’s Admin UI.
Domino provides governance features from within this interface, supporting LDAP/AD federation or SSO-based at-
tributes for managed access control and user execution quotas. We have also published a series of best practices for
managing hardware tiers in your compute grid.

CUDA / NVIDIA driver configuration

Nvidia Driver
Configuration of the Nvidia driver at the host level should be performed by your Server administrator. The correct
Nvidia driver for your host can be identified by using the configuration guide found here. More information can be
found in the DGX Systems Documentation.
CUDA Version
The CUDA software version required for a given development framework, such as Tensorflow, will be documented
on their website. For example, Tensorflow >=2.1 requires CUDA 10.1 and some additional software packages, e.g.,
CuDNN.
CUDA & Nvidia Driver Compatibility
Once the correct CUDA version is identified for your specific needs, one must consult the CUDA-Nvidia Driver
Compatibility Table.
In the Tensorflow 2.1 example, the CUDA 10.1 requirement means one must be running CUDA >=10.1 and Nvidia
driver >=410.48 on the host. Table 1 in the link above will guide your choice of matching CUDA & Nvidia driver
versions.
Subsequently, the Domino Compute Environment must be configured to leverage the exact CUDA version that corre-
sponds to the desired application.
Simplifying this constraint, note that CUDA drivers provide backwards compatibility: the CUDA version on the host
can be greater or equal to that which is specified in your Compute Environment.
And because the CUDA software installation process often returns unexpected results when attempting to install an
exact CUDA version, including patch version, the fastest route to a functioning configuration is typically to install

3.7. NVIDIA DGX in Domino 53


Domino Admin Docs Documentation, Release 4.4.0

the latest available minor release from your required major version of CUDA, and subsequently creating a Docker
environment variable (ENV) from within your Compute Environment that constrains compatible sets of CUDA, GPU
generations, and Nvidia drivers.
Need Additional Assistance?
Please consult your Domino customer success engineer for guidance on your specific needs. Domino can sample
configurations that will simplify your configuration process.

3.7.3 Best Practices

1. Build Node
We recommend you do not use a DGX GPU as a build node for environments. Instead, opt for a CPU resource
as part of your overall Domino architecture.
2. Splitting GPUs per Tier
We recommend providing several GPU tiers with different numbers of GPUs in each tier e.g. 1, 2, 4, and 8 GPU
hardware tiers as different training jobs can take use of single or parallel GPU usage and consuming a whole
DGX box for one workload may not be feasible in your environment.
3. Governance
After splitting up hardware tiers, access can be global or, alternatively, limited to specific organizations. We
recommend ensuring that the right organizations have GPU Hardware Tier access –or are restricted– for the
purpose of ensuring availability for critical work, and/or to prevent the unauthorized use of GPU tiers.

3.8 Domino in Multi-Tenant Kubernetes Cluster

3.8.1 What is Multi-Tenancy?

In the context of Kubernetes and Domino, multi-tenancy means a Kubernetes cluster (hereinafter simply referred to
as “cluster” unless otherwise disabiguated) that supports multiple applications and is not dedicated just to Domino
(i.e. each application is an individual cluster tenant). Domino supports multi-tenant clusters (or multi-tenancy) by
adhering to a set of principles that ensure it does not interfere with other applications or other cluster-wide services
that may exist. This also translates to the installation of Domino into a multi-tenant cluster, assuming typical best
practice multi-tenancy constraints.

3.8.2 Multi-Tenancy Use Cases

• On-Premise and Capacity Constrained Environments. In this case, you are trying to maximize the utilization
of limited, often physical, infrastructure.
• Minimize Administration Costs.

54 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

3.8.3 Multi-Tenancy Risks

• Shared Resource Loading. Multi-tenant clusters still share common resources, such as the Kubernetes control
plane (e.g. API server), DNS, and ingress. This results in how other applications will impact Domino and vice
versa.
• Imperfect Compute Isolation and Predictability. Unless you restrict node-level usage for applications, there
is no isolation at the node level. Hence, Domino Runs will potentially share compute with other applications.
Ill-behaved tenants could impact Domino Runs by hogging resources causing drops in resources available to
Domino or in the worst case, bring down the node. In most cases, this will probably not happen. However, if
particular Domino Runs need predictability or strict isolation, this may be an issue. You can reserve nodes just
for the Domino application in your cluster, but this does drive down the argument for multi-tenancy.
• Increased Security Complexity and Risk. Cluster administrators will likely have to manage a larger, or finer
grain, set of RBAC objects and rules. Shared resources and node-level coupling exposes an additional attack
surface for any malicious tenants.
• Shared Cluster Maintenance. Any cluster maintenance will cause all applications to be subject to the same
maintenance window. Hence, if the cluster maintenance is due to a particular application, all applications will
be subjected to the same down time even though they do not require that maintenance.

Note: Given the risks and data science workload profile, we highly recommend that where possible Domino be
deployed in its own Kubernetes cluster for enterprise and production scenarios.

3.8.4 Known Considerations

Files

If two or more applications attempt to map a file from the “host path” and read or modify that file, then problems
can arise. The use of host paths are frowned upon except for monitoring software and currently, the only place that
Domino requires a host mount is for fluentd to monitor container logs. As this is standard practice for fluentd and an
explicitly read-only operation, we will not interfere with other applications.

System Settings

Applications that require system settings be modified for performance or reliability can interfere with or overwrite
other applications’ settings.

Elasticsearch

Currently, the only service that requires an updated setting for Domino is Elasticsearch and this is currently disable-able
if the cluster operators have an acceptable setting already. vm.map_max_count needs to be set for Elasticsearch to
work; This is not a Domino requirement, but a mandatory requirement from the upstream Elasticsearch Helm chart.

GPU Support

We deploy a number of services in order to properly expose GPUs for Domino. In a multi-tenant environment,
we would generally ask cluster administrators to manage these themselves, and we can disable our services via our
installer.

3.8. Domino in Multi-Tenant Kubernetes Cluster 55


Domino Admin Docs Documentation, Release 4.4.0

DaemonSets

We currently deploy four DaemonSets for a standard install.


1. docker-registry Certificate Management. This allows the underlying Docker daemon to pull from the
Domino deployed Docker registry, which backs Domino Compute Environments. The service mounts the under-
lying /etc/docker/certs.d directory and creates additional files to support the Domino Docker registry.
This is not something that can necessarily interfere with other applications, but may cause concern from cluster
operators and any host-level operation is inherently risky.
2. image-cache-agent. This handles look-ahead caching and image management for the cluster Docker
daemon, allowing for shorter Domino execution start-up times. This should not be deployed on non-Domino
nodes.
3. fluentd. This monitors logs from the User’s compute containers that pushed through a system to feed into
the Jobs and Workspaces dashboard. See Files.
4. prometheus-node-exporter. This monitors node metrics, such as network statistics, and it is
polled by the Domino deployed Prometheus server. This can be disabled with the monitoring.
prometheus_metrics flag.
As of Domino 4.2, all DaemonSets can be limited by a nodeSelector flag which will cause the pods to only be
scheduled on a subset of nodes with a specific label. Depending on the cluster operator’s needs, we will require a
categorical label on nodes for Domino’s use that we could target for deployment

Non-Namespaced Resources

ClusterRoles

Domino creates separate namespaces for its services and requires communication between these namespaces. Domino
creates a number of ClusterRoles and bindings that control access its namespaces or into global resources. As of
Domino 4.2, all Domino-created ClusterRoles are prefixed by the deployment name, which is specified by the name
key in the domino.yml configuration file (See Configuration Reference).

Pod Security Policies

By default, Domino uses pod security policies (PSP) to ensure that, by default, pods cannot use system-level permis-
sions that they have not been granted. Unfortunately, PSPs are globally-namespaced so they too have been prefixed
with the deployment name. Applications cannot use these PSPs without explicitly being granted access through a Role
or Cluster Role.

Custom Resource Definitions

Domino does not make extensive use of Custom Resource Definitions (CRDs) except for the on-demand spark feature
in 4.x. Our CRD is named uniquely, sparkclusters.apps.dominodatalab.com and should not interfere
with other applications.

Persistent Volumes

Domino uses persistent volumes extensively throughout the system to ensure that data storage is abstracted and per-
manent. With the exception of two shared storage mounts, which both incorporate namespaces to ensure uniqueness,
we strictly use dynamic volume creation through persistent volume claims which dynamically allocates a name that
will not conflict with any other application’s.

56 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

3.8.5 Recommendations

• Separate Node Pool for Platform and Compute. Even if Domino is installed in a multi-tenant cluster, we
prefer to have a separate node pool for our Platform and Compute Nodes. This is not always possible, but it’s a
decent compromise. Domino does set resource limits and requests so that it cannot overwhelm individual nodes.

3.9 Encryption in transit

Intra-cluster encryption in transit is implemented via a deployed service mesh, specifically Istio. At installation time,
Domino can deploy Istio for Domino use only, or Domino can be configured to leverage an existing deployed Istio
on the Kubernetes cluster (potentially shared with other applications). See Installation Configuration Reference for
details.

3.9.1 Custom certificate authority certificates

Attention: This is only applicable for a Domino deployed Istio

Out of the box, Istio provides scalable identity and X.509 certificate management for use with mTLS encryption,
including periodic certificate and key rotation. Because all encrypted communication is internal, these certificates are
not exposed or required for communication to any external services, such as web browsers and clients.
We do understand that certain enterprise policies mandate the use of corporate public key infrastructure (PKI) and
necessitate the use of certificate authority (CA) certificates.

Setting up custom CA certificates

Note: All certificates must be X.509 PEM format and keys must be passwordless.

Filename Description
root-cert.pem Root CA certificate for PKI.
ca-cert.pem Intermediate CA certificate from root CA. This is the Istio CA certificate.
ca-key.pem Private key for Istio CA certificate.
cert-chain.pem Full chain from ca-cert.pem to root-cert.pem (including both cer-
tificates).

Assuming N intermediate certificates denoted as int-ca-<i>.pem, with i = {1,...,N}.

3.9. Encryption in transit 57


Domino Admin Docs Documentation, Release 4.4.0

# Concatenate all certificates


cat ca-cert.pem int-ca-1.pem ... int-ca-N.pem root-cert.pem > cert-chain.pem

# Create new secret with CA cert files


kubectl -n istio-system create secret generic cacerts \
--from-file=./ca-cert.pem \
--from-file=./ca-key.pem \
--from-file=./root-cert.pem \
--from-file=./cert-chain.pem

New Domino installation

A standard installation following the install process with the fleetcommand-agent (Domino installer) will auto-
matically pick up the created Secret and Istio will use the custom CA certificates.

Existing Domino installation

Restarting all the pods of the existing Domino installation

Updating existing custom CA certificates

This section describes how to update the custom CA certificate used by Istio for intra-cluster encryption in transit.
There are two scenarios:
1. No changes to the private key and common name
This assumes only ca-cert.pem is updated.
2. Updated to the private key, common name, or upstream certificates
Any of the certificate files have changed, including any upstream intermediate certificates.
In both cases, you need to create a new full chain certificate file (cert-chain.pem)

Tip: We recommend backing up existing certificates and keys before updating new ones.

No changes to private key and common name

The procedure to update the custom CA certificates is to create a Secret with a new files and restart the Istio daemon
(istiod).
# Delete existing secret with CA cert files
kubectl -n istio-system delete secret cacerts

# Create new secret with CA cert files


kubectl -n istio-system create secret generic cacerts \
--from-file=./ca-cert.pem \
--from-file=./ca-key.pem \
--from-file=./root-cert.pem \
--from-file=./cert-chain.pem

(continues on next page)

58 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


# Restarting all istiod pods
kubectl -n istio-system delete po -l app=istiod

Updated private key, common name, or upstream certificates

If changes have been made or are needed to the private key, common name (CN) or upstream certificates, a full restart
is required in addition to creating a new Secret with the new files an restarting the Istio daemon in the previous section.

# Delete existing secret with CA cert files


kubectl -n istio-system delete secret cacerts

# Create new secret with CA cert files


kubectl -n istio-system create secret generic cacerts \
--from-file=./ca-cert.pem \
--from-file=./ca-key.pem \
--from-file=./root-cert.pem \
--from-file=./cert-chain.pem

# Full restart for all Istio pods


for NS in istio-system domino-platform domino-compute; \
do \
kubectl -n $NS get po --no-headers -o custom-columns=name:metadata.name | xargs
˓→kubectl -n $NS delete po; \

done

There are two types of Kubernetes node used by Domino:


• Platform nodes
Platform nodes, labeled with dominodatalab.com/platform-node: true, host the always-on com-
ponents of the Domino application, including the frontends, API server, authentication service, and supporting
metadata services. These nodes host a fixed collection of persistent pods.
• Compute nodes
Compute nodes, labeled with dominodatalab.com/node-pool: <node-name> host user jobs and
published Domino Models and Apps. The workload hosted by these nodes will change with user demand, and
using an elastic cloud cluster will allow for automatic scaling of this pool to meet the needs of active users.
Read the Architecture overview to learn more.

3.10 Compatibility

Domino has been tested and verified to run on the following types of clusters:

3.10. Compatibility 59
Domino Admin Docs Documentation, Release 4.4.0

Vendor Partner
Cluster information

Amazon Elastic Kubernetes Service

Azure Kubernetes Service

Google Kubernetes Engine

Tanzu Kubernetes Grid aka Pivotal Container Service

Red Hat Openshift

Rancher

If you have a cluster from another provider, you can check for compatibility by running the Domino cluster require-
ments checker. If you have questions about cluster compatibility, contact Domino.

60 Chapter 3. Kubernetes
CHAPTER 4

Installation

4.1 Installation process

The Domino platform runs on Kubernetes. To simplify deployment and configuration of Domino services, Domino
provides an install automation tool called the fleetcommand-agent that uses Helm to deploy Domino into your
compatible cluster. The fleetcommand-agent is a Python application delivered in a Docker container, and can
be run locally or as a job inside the target cluster.

61
Domino Admin Docs Documentation, Release 4.4.0

4.1.1 Requirements

The install automation tools are delivered as a Docker image, and need to run on an installation workstation that meets
the following requirements:
• Docker installed
• Kubectl service account access to the cluster
• Access to download and install Helm via package manager or GitHub
• Access to quay.io to download the installer image
Additionally, you will need credentials for an installation service account that can access the Domino upstream image
repositories in quay.io. Throughout these instructions, these credentials will be referred to as $QUAY_USERNAME and
$QUAY_PASSWORD. Contact your Domino account team if you need new credentials.
The fleetcommand-agent needs access to two types of assets to install Domino:
1. Docker images for Domino components
2. Helm charts
The hosting cluster will need access to the following domains via Internet to retrieve component and dependency
images for online installation:
• quay.io
• domino.tech
• k8s.gcr.io
• docker.elastic.co
• docker.io
• gcr.io
Alternatively, you can configure the fleetcommand-agent to point to a private docker registry and application
registry for offline installation.

4.1.2 Pulling the fleetcommand-agent image

1. Log in to quay.io with the credentials described in the requirements section above.

docker login quay.io

2. Find the image URI for the version of the fleetcommand-agent you want to use from the release notes.
3. Pull the image to your local machine.

62 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

docker pull quay.io/domino/fleetcommand-agent:v34

4.1.3 Running fleetcommand-agent commands

The default entrypoint for the fleetcommand-agent is:

"Entrypoint": [
"python",
"-m",
"fleetcommand_agent"
]

This launches the Python application inside the container at /app/fleetcommand_agent. This allows you to
easily run agent commands via docker run like this:

docker run --rm quay.io/domino/fleetcommand-agent:v34 $COMMAND $ARGUMENTS

The fleetcommand-agent supports the following commands:

init

Generates a template configuration file.


Arguments:
• --file -f
File system path to write the template to. This should be a host volume mounted to the container to persist the
output.
• --full -F
Includes optional and advanced portions of the template. Should only be used when advanced options are
needed, as configurations with these fields are more complex to maintain.
• --version
Domino version to generate a configuration template for.

4.1. Installation process 63


Domino Admin Docs Documentation, Release 4.4.0

• --image-registry
Provide a registry URI to prepend to Domino images to set up the template for installation from a private Docker
registry. Should be used in conjunction with --full.
Example:

docker run --rm -v $(pwd):/install quay.io/domino/fleetcommand-agent:v34 init --file /


˓→install/domino.yml

run

Installs Domino into a cluster specified by a Kubernetes configuration from the KUBECONFIG environment variable.
A valid configuration file must be passed in to this command.
Arguments:
• --file -f
File system path to the complete and valid configuration file.
• --kubeconfig
Path to Kubernetes configuration file containing cluster and authentication information to use.
• --dry
Use this mode to not make any permanent changes to the target cluster. A dry run checks service account
permissions and generates detailed logs about the charts to be deployed with the given configuration. The
output is written to `/app/logs and /app/.appr_chart_cache inside the container.
Note that this option requires that the namespaces you want to use already exist, and for Helm 2 there must be
an accessible Tiller.
Example:

docker run --rm -v $(pwd):/install quay.io/domino/fleetcommand-agent:v34 run --file /


˓→install/domino.yml

Example dry run:

docker run --rm -v $(pwd):/install -v $(pwd)/logs:/app/logs -v $(pwd)/cache:/app/.


˓→appr_chart_cache quay.io/domino/fleetcommand-agent:v34 run --dry --file /install/

˓→domino.yml

destroy

Removes all resources from the target cluster for a given configuration file.
Arguments:

64 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

• --file -f
File system path to the complete and valid configuration file.
• --kubeconfig
Path to Kubernetes configuration file containing cluster and authentication information to use.
• --dry
Use this mode to not make any permanent changes to the target cluster. A dry run checks service account
permissions and generates detailed logs about the charts to be deployed with the given configuration.
Example:

docker run --rm -v $(pwd):/install quay.io/domino/fleetcommand-agent:v34 destroy --


˓→file /install/domino.yml

4.1.4 Install process

1. Connect to a workstation that meets the install automation requirements listed above.
2. Log in to quay.io with the credentials described in the requirements section above.

docker login quay.io

3. Retrieve the Domino installer image from quay.io.

docker pull quay.io/domino/fleetcommand-agent:v34

4. Initialize the installer application to generate a template configuration file named domino.yml.

docker run --rm -it \


-v $(pwd):/install \
quay.io/domino/fleetcommand-agent:v34 \
init --file /install/domino.yml

5. Edit the configuration file with all necessary details about the target cluster, storage systems, and hosting do-
main. Read the configuration reference for more information about available keys, and consult the configuration
examples for guidance on getting started.
Note that you should change the value of name from domino-deployment to something that identifies the
purpose of your installation and contains the name of your organization.
6. Run this install script from the directory with the finalized configuration file to install Domino into the cluster.
Note that you must fill in your $QUAY_USERNAME and $QUAY_PASSWORD where indicated, and also note that
this script assumes your installer configuration file is in the same directory, and is named exactly domino.yml.

4.1. Installation process 65


Domino Admin Docs Documentation, Release 4.4.0

#!/bin/bash

set -ex

kubectl delete po --ignore-not-found=true fleetcommand-agent-install

kubectl create secret \


docker-registry \
-o yaml --dry-run \
--docker-server=quay.io \
--docker-username=$QUAY_USERNAME \
--docker-password=$QUAY_PASSWORD \
--docker-email=. domino-quay-repos | kubectl apply -f -

kubectl create configmap \


fleetcommand-agent-config \
-o yaml --dry-run \
--from-file=domino.yml | kubectl apply -f -

cat <<EOF | kubectl apply -f -


apiVersion: v1
kind: ServiceAccount
metadata:
name: admin
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: admin-default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: admin
namespace: default
---
apiVersion: v1
kind: Pod
metadata:
name: fleetcommand-agent-install
spec:
serviceAccountName: admin
imagePullSecrets:
- name: domino-quay-repos
restartPolicy: Never
containers:
- name: fleetcommand-agent
image: quay.io/domino/fleetcommand-agent:v34
args: ["run", "-f", "/app/install/domino.yml", "-v"]
imagePullPolicy: Always
volumeMounts:
- name: install-config
mountPath: /app/install/
volumes:
- name: install-config
(continues on next page)

66 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


configMap:
name: fleetcommand-agent-config
EOF

set +e
while true; do
sleep 5
if kubectl logs -f fleetcommand-agent-install; then
break
fi
done

7. The installation process can take up to 30 minutes to fully complete. The installer will output verbose logs and
surface any errors it encounters, but it can also be useful to follow along in another terminal tab by running:

kubectl get pods --all-namespaces

This will show the status of all pods being created by the installation process. If you see any pods enter a crash
loop or hang in a non-ready state, you can get logs from that pod by running:

kubectl logs $POD_NAME --namespace $NAMESPACE_NAME

If the installation completes successfully, you should see a message that says:

2019-11-26 21:20:20,214 - INFO - fleetcommand_agent.Application - Deployment


˓→complete.

Domino is accessible at $YOUR_FQDN

However, the application will only be accessible via HTTPS at that FQDN if you have configured DNS for
the name to point to an ingress load balancer with the appropriate SSL certificate that forwards traffic to your
platform nodes.

4.1.5 Upgrading

Upgrading a Domino deployment is a simple process of running the installer again with the same configuration, but
with the version field set the value of the desired upgrade version. See the installer configuration reference and the
installer release notes for information on the Domino versions your installer can support.
If you need to upgrade to a newer installer version to upgrade to your desired Domino version, use the process below.
1. Retrieve the new Domino installer image from quay.io by filling in the desired <version> value in the com-
mand below

docker pull quay.io/domino/fleetcommand-agent:<version>

2. Move your existing domino.yml configuration file to another directory, or rename it.

4.1. Installation process 67


Domino Admin Docs Documentation, Release 4.4.0

3. Generate a new domino.yml configuration template by running the initialization command through the new
version of the installer. This will ensure you have a configuration schema conformant to the new version.

docker run --rm -it \


-v $(pwd):/install \
quay.io/domino/fleetcommand-agent:<version> \
init --file /install/domino.yml

4. Copy the values from your old configuration into the new file.
5. When complete, run the install script from the install process, being sure to change the spec.containers.
image value to quay.io/domino/fleetcommand-agent:<version> with the appropriate version.

4.2 Configuration Reference

Key Description Required Values


schema YAML schema version. X 1.0
name Unique deployment name. This should con- X [a-zA-Z0-9_-]+
tain the name of the deployment owner.
version Domino version to install. X Supported versions:
4.1.10, 4.2.0
hostname Hostname Domino application will be ac- X Valid FQDN
cessed at.
pod_cidr If network policies are enabled, allow access Valid CIDR range,
from this CIDR. This range should cover ad- e.g. 10.0.0.0/8
dresses used by your cluster nodes and pods.
ssl_enabled Should Domino only be accessible using X true, false
HTTPS.
ssl_redirect Should Domino only be accessible using X true, false
HTTPS.
Create an NGINX ingress controller.
create_ingress_controller X true, false
Create Kubernetes resource requests and
request_resources X true, false
limits for services.
Use network policies for fine-grained ser-
enable_network_policies X true, false,
vice access. Note: requires a
compatible CNI
plugin e.g. Calico
Enables pod security policies for locked
enable_pod_security_policies X true, false
down system capabilities.
Creates pod security policies for locked
create_restricted_pod_security_policy X true, false
down system capabilities.
Determines resource compatibility with ei-
kubernetes_distribution true cncf or
ther OpenShift or CNCF Kubernetes openshift

68 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

4.2.1 Istio

This section configures how and if an Istio service mesh is deployed by or integrated to Domino. A Domino-deployed
Istio is for Domino use only. These configuration should only installed and/or enable if intra-cluster encryption in
transit is required.

Key Description Required Values


istio. Enable Istio in deployment (i.e. sidecar injection) X true, false
enabled
istio. Install Istio service with Domino X true, false
install
istio.cni Configures whether Istio installation is done with a X true, false
CNI. If true, the installation is done with a CNI and
requires fewer permissions; this is our preferred and
recommended setting. If false, the installation will
add required capabilities to every pod security policy:
NET_ADMIN, NET_BIND_SERVICE, and NET_RAW.
istio. Namespace of the Istio control plane. This field is X true, false
namespace not meant for a Domino-deployed Istio (i.e. istio.
install=true); it is available for integrating with
an existing deployed Istio service within the cluster.

4.2.2 Ingress Controller

This section configures the NGINX ingress controller deployed by the fleetcommand-agent.

Key Description Required Values


Whether to create the ingress controller.
ingress_controller. X true, false
create
When running Domino on GKE you should
ingress_controller. X Cluster UUID
gke_cluster_uuidsupply the GKE cluster UUID here to con-
figure GCP networking for ingress.

4.2. Configuration Reference 69


Domino Admin Docs Documentation, Release 4.4.0

4.2.3 Namespaces

Namespaces are a way to virtually segment Kubernetes executions. Domino will create namespaces according to the
specifications in this section, and the installer requires that these namespaces not already exist at installation time.

Key Description Re- Values


quired
names- Namespace to place Domino ser- X Kubernetes Name
paces.platform.name vices
names- Namespace for user executions X Kubernetes Name Note: may be the same as
paces.compute.name the platform namespace
names- Namespace for deployment X Kubernetes Name
paces.system.name metadata
names- Optional annotations to apply to Kubernetes Annotation
paces.*.annotations each namespace

4.2.4 Storage Classes

Storage Classes are a way to abstract the dynamic provisioning of volumes in Kubernetes.
Domino requires two storage classes:
1. block storage for Domino services and user executions that need fast I/O
2. shared storage that can be shared between multiple executions
Domino supports pre-created storage classes, although the installer can create a shared storage class backed by NFS
or a cloud NFS analog as long as the cluster can access the NFS system for read and write, and the installer can create
several types of block storage classes backed cloud block storage systems like Amazon EBS.

70 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

Key Description Required Values


storage_classes.block.create Whether to create the X
• true
block storage class
• false

storage_classes.block.name X Kubernetes Name Note:


always required due to
platform limitations, can-
not be “” which indicates
the default storage class
storage_classes.block.type Name of the block storage X
• ebs
class to utilize
• hostpath
• gce
• azure-disk

storage_classes.block.base_path
Base path to use on nodes
as a base when using
hostpath volumes
storage_classes.block.defaultWhether to set this storage X
• true
class as the default
• false

storage_classes.shared.createWhether to create the X


• true
shared storage class
• false

storage_classes.shared.name X Kubernetes Name


storage_classes.shared.type Type of the shared storage X
• efs
class to utilize
• nfs
• azure-file
Note that Azure
File requires out-
bound port 445 to
be open from your
Azure cluster

storage_classes.shared.efs.region
EFS store AWS region e.g. us-west-2
storage_classes.shared.efs.filesystem_id
EFS filesystem ID e.g. fs-7a535bd1
storage_classes.shared.nfs.server
NFS server IP or host-
name
storage_classes.shared.nfs.mount_path
Base path to use on the
server when creating
shared storage volumes
storage_classes.shared.nfs.mount_options
YAML List of additional e.g. - mfsymlinks
NFS mount options
storage_classes.shared.azure_file.storage_account
Azure storage account to
create filestores

4.2. Configuration Reference 71


Domino Admin Docs Documentation, Release 4.4.0

4.2.5 Blob Storage

Domino can store long-term, unstructed data in “blob storage” buckets. Currently, only the shared storage class
described above (NFS) and S3 are supported.
To apply a default S3 bucket or shared storage type to all use-cases of blob storage, it is only necessary to fill out the
default setting and make sure enabled is true. Otherwise, all other blob storage uses (projects, logs, and
backups) should be filled out.

Key Description Required Values


blob_storage.default.enabled Whether the default X
• true
configuration should take
• false
precedence over individ-
ual config keys
blob_storage.*.type Which type of blob stor- X
• shared
age to use
• s3

blob_storage.*.s3.region AWS region of the S3 e.g. us-west-2


bucket store
blob_storage.*.s3.bucket S3 bucket name e.g. domino-bucket-1

4.2.6 Autoscaler

For Kubernetes clusters without native cluster scaling in response to new user executions, Domino supports the use of
the cluster autoscaler.

Key Description Required Values


autoscaler.enabled Enable cluster autoscaling X
• true
• false

autoscaler.cloud_provider Cloud provider Domino is • aws


deployed with • azure

autoscaler.aws.region AWS region Domino is e.g. us-west-2


deployed into
autoscaler.azure.resource_group
Azure resource group Azure resource group
Domino is deployed into
autoscaler.azure.subscription_id
Azure subscription ID Azure subscription ID
Domino is deployed with

72 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

AWS Auto-Discovery

The cluster autoscaler supports autodiscovery on AWS. Without any explicit configuration of specific autoscaling
groups, it will detect all ASGs that have the appropriate tags and refresh them if their settings are updated directly.
This means listing all ASGs with accurate min/max settings (or listing them at all) is not required as referenced below
in the Groups section. ASG settings can be updated directly in AWS without having to update the cluster-autoscaler
configuration or rerun the installer.

Key Description Re- Values


quired
au- K8s Cluster Name exactly match the name in AWS
toscaler.auto_discovery.cluster_name
autoscaler.auto_discovery.tags Optional. If filled in, cluster_name e.g. - my.tag or []
is ignored
au- Must be set to [] if using
toscaler.auto_discovery.groups auto_discovery

By default, if no autoscaler.groups and autoscaler.auto_discovery.tags are specified, the cluster_name will be used to
look for the following AWS tags:
• k8s.io/cluster-autoscaler/enabled
• k8s.io/cluster-autoscaler/{{ cluster_name }}
The tags setting can be used to explicitly specify which resource tags the autoscaler service should look for.
If you would like to disable auto-discovery and continue using specific groups, ensure that auto_discovery.
cluster_name is an empty value.

Groups

Autoscaling groups are not dynamically discovered. Each autoscaling group must be individually specified including
the minimum and maximum scaling size.

Key Description Re- Values


quired
autoscaler.groups.*.name Autoscaling group Must exactly match the name in the cloud
name provider
au- e.g. 0
toscaler.groups.*.min_size
au- e.g. 10
toscaler.groups.*.max_size

4.2. Configuration Reference 73


Domino Admin Docs Documentation, Release 4.4.0

4.2.7 External DNS

Domino can automatically configure your cloud DNS provider. More extensive documentation can be found on the
external-dns homepage.

Key Description Required Values


external_dns.enabled Whether Domino should X
• true
configure cloud DNS
• false

external_dns.provider Cloud DNS provider e.g. aws


external_dns.domain_filters Only allow access to do- e.g. my-domain.
mains that match this filter example.com
external_dns.zone_id_filters Only allow updates to
specific Route53 hosted
zones

4.2.8 Email Notifications

Domino supports SMTP for sending email notifications in response to user actions and run results.

Key Description Required Values


email_notifications.enabled Whether Domino should X
• true
send email notifications
• false

email_notifications.server SMTP server hostname or


IP
email_notifications.port SMTP server port
email_notifications.encryptionWhether the SMTP server
uses SSL encryption
email_notifications.from_address
Email address to send e.g. domino
emails from Domino with @example.com
email_notifications.authentication.username
If using SMTP authentica-
tion, the username
email_notifications.authentication.password
If using SMTP authentica-
tion, the password

74 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

4.2.9 Monitoring

Domino supports in-cluster monitoring with Prometheus as well as more detailed, external monitoring through
NewRelic APM and Infrastructure.

Key Description Required Values


monitoring.prometheus_metrics
Install Prometheus moni- X
• true
toring
• false

monitoring.newrelic.apm Enable NewRelic APM X


• true
• false

monitoring.newrelic.infrastructure
Enable NewRelic Infras- X
• true
tructure
• false

monitoring.newrelic.license_key
NewRelic account license
key

4.2.10 Helm

Configuration for the Helm repository that stores Domino’s charts.

4.2. Configuration Reference 75


Domino Admin Docs Documentation, Release 4.4.0

Key Description Required Values


helm.version Which version of Helm to use. X 2 or 3
helm.host Hostname of the chart repository X For Helm 2 this
should be quay.
io or the address
of your private appr
server. For Helm 3 it
should be gcr.io.
helm. Namespace to find charts in the repository. Helm repo names-
namespace pace. When
using official
Domino reposito-
ries this should be
domino. For Helm
3 with gcr.io
or mirrors.
domino.
tech, use
domino-eng-service-artifact
helm.prefix Prefix for the chart repository. Application reg-
istry prefix. When
using official
Domino reposito-
ries this should be
helm-. For Helm
3 with gcr.io
or mirrors.
domino.tech,
this should be an
empty string.
helm.username Username for chart repository if authenti- Username
cation is required. When using Helm 3
with charts hosted in GCR this must be
_json_key.
helm.password Password for chart repository if authentica- For Helm 3 this is
tion is required. the base64 encoded
JSON key that
was provided by
Domino.
helm. URI of the Docker image for the Tiller ser- X This must point to a
tiller_image vice to use when running Helm 2. version 2.16.1 Tiller
image at gcr.io/
kubernetes-helm/
tiller:v2.16.
1 or in your private
registry.
helm. Path to cached Helm 3 chart files. Set to empty string
cache_path ('') to use online
chart data.

76 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

4.2.11 Private Docker Registry

Configuration for the Docker repository that stores Domino’s images.

Key Description Required Values


private_docker_registry.serverDocker registry host X
• quay.io
• mirrors.
domino.tech

private_docker_registry.username
Docker registry username X
private_docker_registry.password
Docker registry password X

4.2.12 Internal Docker Registry

The recommended configuration for the internal Docker registry deployed with Domino. Override values are to allow
the registry to use S3, GCS, or Azure blob store as a backend store. GCS requires a service account already be bound
into the Kubernetes cluster with configuration to ensure the docker-registry service account is properly mapped.

Key Description Re- Values


quired
internal_docker_registry.s3_override.region AWS region of the S3 bucket store e.g. us-west-2
internal_docker_registry.s3_override.bucket S3 bucket name e.g. domino-bucket-1
internal_docker_registry.gcs_override.bucket GCS bucket name e.g. domino-bucket-1
internal_docker_registry.gcs_override.service_account_name
GCS service account with access
to the bucket
internal_docker_registry.gcs_override.project_name
GCP project name that Domino is
deployed into
internal_docker_registry.azure_blobs_override.account_name
Azure blobstore account name
internal_docker_registry.azure_blobs_override.account_key
Azure blobstore account key
internal_docker_registry.azure_blobs_override.container
Azure blobstore container name

4.2. Configuration Reference 77


Domino Admin Docs Documentation, Release 4.4.0

4.2.13 Telemetry

Domino supports user telemetry data to help improve the product.

Key Description Required Values


intercom.enabled Enable Intercom onboarding X true|false
mixpanel.enabled Enable MixPanel X true|false
mixpanel.token MixPanel API token

4.2.14 GPU

If using GPU compute nodes, enable the following configuration setting to install the required components.

Key Description Required Values


gpu.enabled Enable GPU support X true|false

4.2.15 Fleetcommand

Domino supports upgrading minor patches through an internal tool named Fleetcommand.

Key Description Re- Values


quired
fleetcommand.enabled Enable ability for Domino staff to apply minor patches X true|false
fleetcom- Deployment-specific API token (Domino staff will provide
mand.api_token this)

78 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

4.2.16 Node selectors

Domino will by default deploy some DaemonSets on all available nodes in the hosting cluster. When running in a
multi-tenant Kubernetes cluster, where some nodes are available that should not be used by Domino, you can label
nodes for Domino with a single, consistent label, then provide that label to the fleetcommand-agent with the below
configuration to apply a selector to all Domino resources for that label.

Key Description Required Values


List of key/value pairs to use as the label for
global_node_selectors Optional See below example
the selector.

Example

global_node_selectors:
domino-owned: "true"

This example would apply a selector for domino-owned=true to all Domino deployment resources.

4.2.17 Ingress controller class

The name of the Domino Ingress class can be changed with this setting. This should generally not need to change.

Key Description Required Values


Name for the Domino Ingress class
ingress_controller. X nginx
class_name

4.2. Configuration Reference 79


Domino Admin Docs Documentation, Release 4.4.0

4.2.18 Image caching

These settings control the Domino image caching service, which runs as a privileged pod and uses the host Docker
socket to pre-pull popular Domino environment images onto compute workers. It can be disabled if desired.

Key Description Required Values


image_caching. Whether or not to deploy the image caching X Boolean
enabled service

4.3 Installer configuration examples

4.3.1 EKS example

schema: '1.0'
name: $YOUR_ORGANIZATION_NAME
version: 4.3.3
hostname: $YOUR_DESIRED_APPLICATION_HOSTNAME
pod_cidr: '$YOUR_POD_CIDR'
ssl_enabled: true
ssl_redirect: true
request_resources: true
enable_network_policies: true
enable_pod_security_policies: true
global_node_selectors: {}
create_restricted_pod_security_policy: true
kubernetes_distribution: cncf
istio:
enabled: false
install: false
cni: true
namespace: istio-system
namespaces:
platform:
name: domino-platform
annotations: {}
labels:
domino-platform: 'true'
compute:
name: domino-compute
annotations: {}
labels:
domino-compute: 'true'
system:
name: domino-system
annotations: {}
labels: {}
ingress_controller:
create: true
gke_cluster_uuid: ''
class_name: nginx
storage_classes:
block:
create: true
name: dominodisk
(continues on next page)

80 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


type: ebs
access_modes:
- ReadWriteOnce
base_path: ''
default: false
parameters: {}
shared:
create: true
name: dominoshared
type: efs
access_modes:
- ReadWriteMany
efs:
region: '$YOUR_AWS_REGION'
filesystem_id: '$YOUR_EFS_ID'
nfs:
server: ''
mount_path: ''
mount_options: []
azure_file:
storage_account: ''
blob_storage:
projects:
type: s3
s3:
region: ''
bucket: ''
sse_kms_key_id: ''
access_key_id: ''
secret_access_key: ''
azure:
account_name: ''
account_key: ''
container: ''
gcs:
bucket: ''
service_account_name: ''
project_name: ''
logs:
type: s3
s3:
region: ''
bucket: ''
sse_kms_key_id: ''
access_key_id: ''
secret_access_key: ''
azure:
account_name: ''
account_key: ''
container: ''
gcs:
bucket: ''
service_account_name: ''
project_name: ''
backups:
type: s3
s3:
(continues on next page)

4.3. Installer configuration examples 81


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


region: ''
bucket: ''
sse_kms_key_id: ''
access_key_id: ''
secret_access_key: ''
azure:
account_name: ''
account_key: ''
container: ''
gcs:
bucket: ''
service_account_name: ''
project_name: ''
default:
type: s3
s3:
region: '$YOUR_AWS_REGION'
bucket: '$YOUR_BUCKET_NAME'
sse_kms_key_id: ''
access_key_id: ''
secret_access_key: ''
azure:
account_name: ''
account_key: ''
container: ''
gcs:
bucket: ''
service_account_name: ''
project_name: ''
enabled: false
autoscaler:
enabled: true
cloud_provider: aws
auto_discovery:
cluster_name: $YOUR_EKS_CLUSTER_NAME
tags: []
groups:
- name: ''
min_size: 0
max_size: 0
aws:
region: ''
azure:
resource_group: ''
subscription_id: ''
spotinst_controller:
enabled: false
token: ''
account: ''
external_dns:
enabled: false
provider: aws
domain_filters: []
zone_id_filters: []
git:
storage_class: dominodisk
email_notifications:
(continues on next page)

82 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


enabled: false
server: smtp.customer.org
port: 465
encryption: ssl
from_address: domino@customer.org
authentication:
username: ''
password: ''
monitoring:
prometheus_metrics: true
newrelic:
apm: false
infrastructure: false
license_key: ''
helm:
version: 3
host: ''
namespace: ''
insecure: false
username: ''
password: ''
skip_daemonset_validation: false
daemonset_timeout: null
tiller_image: ''
prefix: ''
cache_path: '/app/charts'
private_docker_registry:
server: quay.io
username: '$YOUR_DOMINO_PROVIDED_CREDENTIAL'
password: '$YOUR_DOMINO_PROVIDED_CREDENTIAL'
internal_docker_registry:
s3_override:
region: ''
bucket: ''
sse_kms_key_id: ''
access_key_id: ''
secret_access_key: ''
gcs_override:
bucket: ''
service_account_name: ''
project_name: ''
azure_blobs_override:
account_name: ''
account_key: ''
container: ''
telemetry:
intercom:
enabled: false
mixpanel:
enabled: false
token: ''
gpu:
enabled: true
fleetcommand:
enabled: false
api_token: ''
teleport:
(continues on next page)

4.3. Installer configuration examples 83


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


acm_arn: arn:aws:acm:<region>:<account>:certificate/<id>
enabled: false
hostname: teleport-domino.example.org
remote_access: false
image_caching:
enabled: true

4.4 Private or offline installation

Domino provides bundles of offline installation media for use when running the fleetcommand-agent without
Internet access to upstream sources of images and charts. To serve these resources, you must have a Docker registry
accessible to your cluster.

4.4.1 Downloading

You can find URLs of available offline installation bundles in the fleetcommand-agent release notes. These
bundles can be downloaded via cURL with basic authentication. Contact your domino account team for credentials.
Note that there is one file required: a versioned collection of images.
Example download:

curl -u username:password -#SfLOJ https://mirrors.domino.tech/s3/domino-artifacts/


˓→offline/opsless-v34-docker-images-4.4.0.tar

4.4.2 Extracting and loading

The images bundle is a .tar archive that must be extracted before being used.

tar -xvf fleetcommand-agent-docker-images-v34-4.4.0.tar

In the fleetcommand-agent-docker-images bundle there will be:

84 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

• a collection of individual Docker image .tar files


• a images.json metadata file
• a domino-load-images.py script
domino-load-images.py is a script to ingest the images.json metadata file and load the associated Docker
images for a specific Domino version into the given remote Docker registry.
To load images into your private registry, run domino-load-images.py and pass in the URL of your registry as
an argument. The script expects to run in the same directory as the images.json metadata file and the .tar image
files.
Example:
python domino-load-images.py your-registry-url.domain:port

Once images have been loaded into your private registry you’re ready to install Domino.

4.4.3 Installing

To install Domino using a custom registry, the image references must be modified to reference the upstream registry.
Use the --image-registry argument on the init command to modify all image references to the external
registry.
docker run --rm -v $(pwd):/install quay.io/domino/fleetcommand-agent:v34 \
init --image-registry your-registry-url.domain:port --full --file /install/domino.yml

If your registry requires authentication, ensure the private_docker_registry section of your installer config-
uration is filled in with the correct credentials:
private_docker_registry:
server: your-registry-url.domain:port
username: '<username>'
password: '<password>'

Helm 3

Charts come pre packaged within the fleetcommand-agent image. Set up the helm object in configuration to
match the following:
helm:
version: 3
host: gcr.io
namespace: domino-eng-service-artifacts
prefix: ''
username: ''
(continues on next page)

4.4. Private or offline installation 85


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


password: ''
tiller_image: gcr.io/kubernetes-helm/tiller:v2.16.1 # Version is required and MUST
˓→be 2.16.1

insecure: false
cache_path: '/app/charts'

Note that the http protocol before the hostname in this configuration is important. Once these changes have been
made to your installer configuration file, you can run the fleetcommand-agent to install Domino.

4.4.4 Configuration

When performing offline installations there are 3 main central configuration keys that need to be repointed to the
private registry hosting the referenced images. From the Domino landing page, click Admin in the main menu. Then
in the administration portal, click Advanced > Central Config. Use the Add Record button at top right to add the
following records:

Key Value
com.cerebro.domino.builder.image IMAGE_URI of the latest domino/builder-job
com.cerebro.domino.computegrid. IMAGE_URI of the latest domino/executor
kubernetes.executor.imageName
com.cerebro.domino.modelmanager. IMAGE_URI of the latest domino/harness-proxy
harnessProxy.image

4.5 fleetcommand-agent release notes

4.5.1 fleetcommand-agent v34 (February 2021)

Image: quay.io/domino/fleetcommand-agent:v34
Installation bundles:
• 4.4.0 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v34-docker-images-4.4.0.tar
Changes
• Adds support for Domino 4.4.0

4.5.2 fleetcommand-agent v33 (February 2021)

Image: quay.io/domino/fleetcommand-agent:v33
Installation bundles:

86 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

• 4.4.0 images https://mirrors.domino.tech/s3/domino-artifacts/offline/


opsless-v33-docker-images-4.4.0.tar
Changes
• Adds support for Domino 4.4.0
• New configuration options have been added for the new Teleport Kubernetes agent:
If a deployment currently has teleport.enabled and teleport.remote_access set to true, they
should be disabled and teleport_kube_agent.enabled should be set instead.

teleport_kube_agent:
enabled: false
proxyAddr: teleport.domino.tech:443
authToken: eeceeV4sohh8eew0Oa1aexoTahm3Eiha

• Domino 4.4.0 includes support for restartable workspace disaster recovery in AWS leveraging EBS snapshots.
To support this functionality, existing installations may potentially require additional IAM permissions for plat-
form node pool instances.
The permissions required, without any resource restriction (i.e. *), are the following:
– ec2:CreateSnapshot
– ec2:CreateTags
– ec2:DeleteSnapshot
– ec2:DeleteTags
– ec2:DescribeAvailabilityZones
– ec2:DescribeSnapshots
– ec2:DescribeTags
Known Issues
• If upgrading from Helm 2 to Helm 3, please read the release notes from v22 for caveats and known issues.

4.5.3 fleetcommand-agent v32 (December 2020)

Image: quay.io/domino/fleetcommand-agent:v32
Installation bundles:
• 4.3.3 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v32-docker-images-4.3.3.tar
Changes:
• Fixes a memory leak in the EFS CSI driver.

4.5.4 fleetcommand-agent v31 (December 2020)

Image: quay.io/domino/fleetcommand-agent:v31
Installation bundles:

4.5. fleetcommand-agent release notes 87


Domino Admin Docs Documentation, Release 4.4.0

• 4.3.3 images https://mirrors.domino.tech/s3/domino-artifacts/offline/


opsless-v31-docker-images-4.3.3.tar
Changes:
• Updates to latest build of 4.3.3.

4.5.5 fleetcommand-agent v30 (December 2020)

Image: quay.io/domino/fleetcommand-agent:v30
Installation bundles:
• 4.3.3 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v30-docker-images-4.3.3.tar
Changes:
• Adds support for Domino 4.3.3
• The agent now supports installing Istio 1.7 (set istio.install to true), and installing Domino in Istio-
compatible mode (set istio.enabled to true).

istio:
enabled: false
install: false
cni: true
namespace: istio-system

• The EFS storage provider for new installs has changed from efs-provisioner to the EFS CSI driver, in
order to support encryption in transit to EFS. For existing installs, this does not require any changes unless
encryption in transit is desired. If a migration to encrypted EFS is necessary, please contact Domino support.
One limitation of the new driver, compared to the previous, is an inability to dynamically create directories
according to provisioned volumes. Support for pre-provisioned directories in AWS is done through access
points, which must be created before Domino can be installed.
To specify the access point at install time, ensure the filesystem_id is set in the format {EFS ID}::{AP ID}:

storage_classes:
shared:
efs:
filesystem_id: 'fs-285b532d::fsap-00cb72ba8ca35a121'

• Two new fields were added in order simplify DaemonSet management during upgrades for particularly large
clusters. DaemonSets do not have configuration options for upgrades and pods will be replaced one-by-one. For
large compute node pools, this can take a significant amount of time.

helm:
skip_daemonset_validation: false
daemonset_timeout: 300

Setting helm.skip_daemonset_validation to true will bypass post-upgrade validation that all pods
have been successfully recreated. helm.daemonset_timeout is an integer representing the number of
seconds to wait for all daemon pods in a DaemonSet to be recreated.

88 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

• 4.3.3 introduces limited availability of the new containerized Domino image builder: Forge. Forge can be
enabled with the ImageBuilderV2 feature flag, although Domino services must be restarted to cause this
flag to take effect. Running Domino image builds in a cluster that uses a non-Docker container runtime, such as
cri-o or containerd, requires that the feature flag be enabled.
To support the default rootless mode that Forge is configured to use, the worker nodes must support unprivileged
mounts, user namespaces, and overlayfs (either natively or through FUSE). Currently, GKE and EKS do not
support user namespace remapping and require the following extra configuration to properly use Forge.

services:
forge:
chart_values:
config:
fullPrivilege: true

4.5.6 fleetcommand-agent v29 (November 2020)

Image: quay.io/domino/fleetcommand-agent:v29
Installation bundles:
• 4.3.2 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v29-docker-images-4.3.2.tar
Changes:
• Updated Keycloak migration job version.

4.5.7 fleetcommand-agent v28 (November 2020)

Image: quay.io/domino/fleetcommand-agent:v28
Installation bundles:
• 4.3.2 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v28-docker-images-4.3.2.tar
Changes:
• Adds support for Domino 4.3.2
• Adds support for encrypted EFS access by using the EFS CSI driver.
• A new istio field has been added to the domino.yml schema for testing and development of future releases.
Domino 4.3.2 does not support Istio and therefore you must set enabled in this new section to false.

istio:
enabled: false
install: false
cni: true
namespace: istio-system

4.5. fleetcommand-agent release notes 89


Domino Admin Docs Documentation, Release 4.4.0

• New fields to specify static AWS access key and secret key credentials have been added. These are currently
unused and can be left unset.

blob_storage:
projects:
s3:
access_key_id: ''
secret_access_key: ''

• A new field for Teleport remote access integration has been added. This is currently unused and should be set
to false.

teleport:
remote_access: false

4.5.8 fleetcommand-agent v27 (October 2020)

Image: quay.io/domino/fleetcommand-agent:v27
Installation bundles:
• 4.3.1 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v27-docker-images-4.3.1.tar
Changes:
• Fix a bug where dry-run installation could cause internal credentials to be improperly rotated.

4.5.9 fleetcommand-agent v26 (October 2020)

Image: quay.io/domino/fleetcommand-agent:v26
Installation bundles:
• 4.3.1 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v26-docker-images-4.3.1.tar
• Latest charts http://mirrors.domino.tech/artifacts/appr/domino-appr-latest.tar.
gz
Changes:
• Adds support for Domino 4.3.1
• Adds support for running Domino on OpenShift 4.4+.
• A new field has been added to the installer configuration that controls whether or not the image caching service
is deployed.

image_caching:
enabled: true

90 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

• A new field has been added to the installer configuration that specifies the Kubernetes distribution for resource
compatibility. The available options are cncf (Cloud Native Computing Foundation) and openshift.

kubernetes_distribution: cncf

4.5.10 fleetcommand-agent v25 (August 2020)

Image: quay.io/domino/fleetcommand-agent:v25
Installation bundles:
• 4.3.0 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v25-docker-images-4.3.0.tar
• Latest charts http://mirrors.domino.tech/artifacts/appr/domino-appr-latest.tar.
gz
Changes:
• Adds support for Domino 4.3.0.
• A new cache_path field has been added to the helm configuration section. Leaving this field blank will
ensure charts are fetched from an upstream repository.

helm:
cache_path: ''

• To facilitate deployment of Domino into clusters with other tenants, a new global node selector field has been
added to the top-level configuration that allows an arbitrary label to be used for scheduling all Domino work-
loads. Its primary purpose is to limit workloads such as DaemonSets that would be scheduled on all available
nodes in the cluster to only nodes with the provided label. Note that this can override default node pool selectors
such as dominodatalab.com/node-pool: "platform", but does not replace them.

global_node_selectors:
domino-owned: "true"

• To facilitate deployment of Domino into clusters with other tenants, a configurable Ingress class has been added
to allow differentiation from other ingress providers in a cluster. If multiple Ingress objects are created with
the default class, it’s possible for other tenant’s paths to interfere with Domino and vice versa. Generally, this
setting does not need to change, but can be set to any arbitrary string value (such as domino).

ingress_controller:
class_name: nginx

4.5.11 fleetcommand-agent v24 (July 2020)

Image: quay.io/domino/fleetcommand-agent:v24
Installation bundles:

4.5. fleetcommand-agent release notes 91


Domino Admin Docs Documentation, Release 4.4.0

• 4.2.4 images https://mirrors.domino.tech/s3/domino-artifacts/offline/


opsless-docker-images-v24-4.2.4.tar
• Latest charts http://mirrors.domino.tech/artifacts/appr/domino-appr-latest.tar.
gz
Changes:
• Adds support for Domino 4.2.3 and 4.2.4.

4.5.12 fleetcommand-agent v23 (May 2020)

Image: quay.io/domino/fleetcommand-agent:v23
Installation bundles:
• 4.2.2 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-docker-images-v23-4.2.2.tar
• Latest charts http://mirrors.domino.tech/artifacts/appr/domino-appr-latest.tar.
gz
Changes:
• Adds support for Domino 4.2.2.
• The known issue with v22 around Domino Apps being stopped after upgrade has been resolved. Apps will now
automatically restart after upgrade.
• The known issue with Elasticsearch not upgrading until manually restarted has been resolved. Elasticsearch will
automatically cycle through a rolling upgrade when the deployment is upgraded.
• Fixed an issue that prevented the fleetcommand-agent
• Adds support for autodiscovery of scaling resources by the cluster autoscaler.
Two new fields have been added under the autoscaler.auto_discovery key:

autoscaler:
auto_discovery:
cluster_name: domino
tags: [] # optional. if filled in, cluster_name is ignored.

By default, if no autoscaler.groups or autoscaler.auto_discovery.tags are specified, the


cluster_name will be used to look for the following AWS tags:
– k8s.io/cluster-autoscaler/enabled
– k8s.io/cluster-autoscaler/{{ cluster_name }}
The tags parameter can be used to explicitly specify which resource tags the autoscaler service should look
for. Auto scaling groups with matching tags will have their scaling properties detected and the autoscaler will
be configured to scale them.
Note that the IAM role for the platform nodes where the autoscaler is running still needs an autoscaling access
policy that will allow it to read and scale the groups.
When upgrading from an install that uses specific groups, ensure that auto_discovery.cluster_name
is an empty value.

92 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

Known Issues:
• If you’re upgrading from fleetcommand-agent v21 or older, be sure to read the v22 release notes and implement
the Helm configuration changes.
• An incompatibility between how nginx-ingress was initially installed and should be maintained going
forward means that action is required for both Helm 2 and Helm 3 upgrades.
For Helm 2 upgrades, add the following services object to your domino.yml to ensure compatibility:

services:
nginx_ingress:
chart_values:
controller:
metrics:
service:
clusterIP: "-"
service:
clusterIP: "-"

For Helm 3, there are two options. If nginx-ingress has not been configured to provide a cloud-native load
balancer that is tied to the hosting DNS entry, then nginx-ingress can be safely uninstalled prior to the
upgrade. If, however, the load balancer address must be maintained across the upgrade, then the initial upgrade
after the Helm 3 migration will fail. Before retrying the upgrade, execute the following commands.

export NAME=nginx-ingress
export SECRET=$(kubectl get secret -l owner=helm,status=deployed,name=$NAME -n
˓→domino-platform | awk '{print $1}' | grep -v NAME)

kubectl get secret -n domino-platform $SECRET -oyaml | sed "s/release:.*/release:


˓→$(kubectl get secret -n domino-platform $SECRET -ogo-template="{{ .data.release

˓→| base64decode | base64decode }}" | gzip -d - | sed 's/clusterIP: \\"\\"//g' |

˓→gzip | base64 -w0 | base64 -w0)/" | kubectl replace -f -

kubectl get secret -n domino-platform $SECRET -oyaml | sed "s/release:.*/release:


˓→$(kubectl get secret -n domino-platform $SECRET -ogo-template="{{ .data.release

˓→| base64decode | base64decode }}" | gzip -d - | sed 's/rbac.authorization.k8s.

˓→io\/v1beta1/rbac.authorization.k8s.io\/v1/g' | gzip | base64 -w0 | base64 -w0)/

˓→" | kubectl replace -f -

4.5.13 fleetcommand-agent v22 (May 2020)

Image: quay.io/domino/fleetcommand-agent:v22
Installation bundles:
• 4.2.0 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-docker-images-v22-4.2.0.tar
• Latest charts http://mirrors.domino.tech/artifacts/appr/domino-appr-latest.tar.
gz
Changes:
• Adds support for Domino 4.2.
• Adds support for Helm 3

4.5. fleetcommand-agent release notes 93


Domino Admin Docs Documentation, Release 4.4.0

The helm object in the installer configuration has been restructured to accommodate Helm 3 support. There is
now a helm.version property which can be set to 2 or 3. When using Helm 2, the configuration should be
similar to the below example. The username and password will continue to be standard Quay.io credentials
provided by Domino.

helm:
version: 2
host: quay.io
namespace: domino
prefix: helm- # Prefix for the chart repository, defaults to `helm-`
username: "<username>"
password: "<password>"
tiller_image: gcr.io/kubernetes-helm/tiller:v2.16.1 # Version is required and
˓→MUST be 2.16.1

insecure: false

When using Helm 3, configure the object as shown below. Helm 3 is a major release of the underlying tool
that powers installation of Domino’s services. Helm 3 removes the Tiller service, which was the server-side
component of Helm 2. This improves the security posture of Domino installation by reducing the scope and
complexity of required RBAC permissions, and it enables namespace isolation of services. Additionally, Helm
3 adds support for storing charts in OCI registries.
Currently, only gcr.io and mirrors.domino.tech are supported as chart repositories. If you are switching to Helm
3, you will need to contact Domino for gcr.io credentials. When using Helm 3, the helm configuration object
should be similar to the below example

helm:
version: 3
host: gcr.io
namespace: domino-eng-service-artifacts
insecure: false
username: _json_key # To support GCR authentication, this must be "_json_key"
password: "<password>"
tiller_image: null # Not required for Helm 3
prefix: '' # Charts are stored without a prefix by default

Migration of an existing Helm 2 installation to Helm 3 is done seamlessly within the installer. Once successful,
Tiller will be removed from the cluster and all Helm 2 configuration is deleted.
Known Issues:
• Elasticsearch is currently configured to only upgrade when the pods are deleted. To properly upgrade an existing
deployment from Elasticsearch 6.5 to 6.8, after running the installer use the rolling upgrade process. This
involves first deleting the elasticsearch-data pods, then the elasticsearch-master pods. See
the example procedure below.

kubectl delete pods --namespace domino-platform elasticsearch-data-0


˓→elasticsearch-data-1 --force=true --grace-period=0

# Wait for elasticsearch-data-0 & elasticsearch-data-1 to come back online


kubectl delete pods --namespace domino-platform elasticsearch-master-0
˓→elasticsearch-master-1 elasticsearch-master-2

• An incompatibility between how nginx-ingress was initially installed and should be maintained going
forward means that action is required for both Helm 2 and Helm 3 upgrades.
For Helm 2 upgrades, add the following services object to your domino.yml to ensure compatibility:

94 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

services:
nginx_ingress:
chart_values:
controller:
metrics:
service:
clusterIP: "-"
service:
clusterIP: "-"

For Helm 3, there are two options. If nginx-ingress has not been configured to provide a cloud-native load
balancer that is tied to the hosting DNS entry, then nginx-ingress can be safely uninstalled prior to the
upgrade. If, however, the load balancer address must be maintained across the upgrade, then the initial upgrade
after the Helm 3 migration will fail. Before retrying the upgrade, execute the following commands.

export NAME=nginx-ingress
export SECRET=$(kubectl get secret -l owner=helm,status=deployed,name=$NAME -n
˓→domino-platform | awk '{print $1}' | grep -v NAME)

kubectl get secret -n domino-platform $SECRET -oyaml | sed "s/release:.*/release:


˓→$(kubectl get secret -n domino-platform $SECRET -ogo-template="{{ .data.release

˓→| base64decode | base64decode }}" | gzip -d - | sed 's/clusterIP: \\"\\"//g' |

˓→gzip | base64 -w0 | base64 -w0)/" | kubectl replace -f -

kubectl get secret -n domino-platform $SECRET -oyaml | sed "s/release:.*/release:


˓→$(kubectl get secret -n domino-platform $SECRET -ogo-template="{{ .data.release

˓→| base64decode | base64decode }}" | gzip -d - | sed 's/rbac.authorization.k8s.

˓→io\/v1beta1/rbac.authorization.k8s.io\/v1/g' | gzip | base64 -w0 | base64 -w0)/

˓→" | kubectl replace -f -

• Domino Apps do not currently support a live upgrade from version 4.1 to version 4.2. After the upgrade, all
Apps will be stopped.
To restart them, you can use the /v4/modelProducts/restartAll endpoint like in the below example,
providing an API key for a system administrator.

curl -X POST --include --header "X-Domino-Api-Key: <admin-api-key>" 'https://


˓→<domino-url>/v4/modelProducts/restartAll'

4.5.14 fleetcommand-agent v21 (May 2020)

Image: quay.io/domino/fleetcommand-agent:v21
Changes:
• Adds support for Domino 4.1.10
Known issues:
• The deployed version 8.0.1 of Keycloak has an incorrect default First Broker Login authentication flow.
When setting up an SSO integration, you must create a new authentication flow like the one below. Note
that the Automatically Link Account step is a custom flow, and the Create User if Unique
and Automatically Set Existing User executions must be nested under it by adding them with the
Actions link.

4.5. fleetcommand-agent release notes 95


Domino Admin Docs Documentation, Release 4.4.0

4.5.15 fleetcommand-agent v20 (March 2020)

Image: quay.io/domino/fleetcommand-agent:v20
Changes:
• Support for 4.1.9 has been updated to reflect a new set of artifacts.
Known issues:
• The deployed version 8.0.1 of Keycloak has an incorrect default First Broker Login authentication flow.
When setting up an SSO integration, you must create a new authentication flow like the one below. Note
that the Automatically Link Account step is a custom flow, and the Create User if Unique
and Automatically Set Existing User executions must be nested under it by adding them with the
Actions link.

4.5.16 fleetcommand-agent v19 (March 2020)

Image: quay.io/domino/fleetcommand-agent:v19
Changes:
• Added catalogs for Domino up to 4.1.9
• Added support for Docker NO_PROXY configuration. Domino containers will now respect the configuration
and connect to the specified hosts without proxy.
Known issues:
• The deployed version 8.0.1 of Keycloak has an incorrect default First Broker Login authentication flow.
When setting up an SSO integration, you must create a new authentication flow like the one below. Note
that the Automatically Link Account step is a custom flow, and the Create User if Unique

96 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

and Automatically Set Existing User executions must be nested under it by adding them with the
Actions link.

4.5.17 fleetcommand-agent v18 (March 2020)

Image: quay.io/domino/fleetcommand-agent:v18
Changes:
The following new fields have been added to the fleetcommand-agent installer configuration.
1. Storage class access modes
The storage_class options have a new field called access_modes that allows configuration of the un-
derlying storage class’ allowed access modes.

storage_classes:
block:
[snip]
access_modes:
- ReadWriteOnce

2. Git service storage class


Previously, the deployed Domino Git service used storage backed by the shared storage class. Now, the
dominodisk block storage class will be used by default. If using custom storage classes, set this to the name
of the block storage class. For existing Domino installations, you must set this to dominoshared.
For new installs:

git:
storage_class: dominodisk

For existing installations and upgrades:

git:
storage_class: dominoshared
services:
git_server:
chart_values:
persistence:
size: 5Ti

4.5. fleetcommand-agent release notes 97


Domino Admin Docs Documentation, Release 4.4.0

4.5.18 fleetcommand-agent v17 (March 2020)

Image: quay.io/domino/fleetcommand-agent:v17
Changes:
• Added catalogs for Domino up to 4.1.8

4.5.19 fleetcommand-agent v16 (February 2020)

Image: quay.io/domino/fleetcommand-agent:v16
Changes:
• Added catalogs for Domino up to 4.1.7
• Calico CNI is now installed by default for EKS deployments
• AWS Metadata API is blocked by default for Domino version >= 4.1.5
• Added Private registry support in the Installer
• New Install configuration attributes (see the reference documentation for more details):
– sse_kms_key_id option for Blob storage
– gcs option for Google Cloud Storage
– Namespaces now support optional labels to apply labels during installation
– teleport for Domino managed installations only

4.5.20 fleetcommand-agent v15 (January 2020)

Image: quay.io/domino/fleetcommand-agent:v15
Changes:
• Added catalog for Domino 4.1.4
• Ensure fleetcommand-agent also deletes system namespace.
• Updated version of Cluster Autoscaler to 1.13.9

4.5.21 fleetcommand-agent v14 (January 2020)

Image: quay.io/domino/fleetcommand-agent:v14
Changes:

98 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0

• Updated version of Cluster Autoscaler to 1.13.7


• Added catalog for Domino 4.1.3

4.5. fleetcommand-agent release notes 99


Domino Admin Docs Documentation, Release 4.4.0

100 Chapter 4. Installation


CHAPTER 5

Configuration

5.1 Central Configuration

The Central Configuration is where all global settings for a Domino installation are enumerated. You can access the
Central Configuration interface from the Admin portal by clicking Advanced > Central Config.
The interface is organized into a list of records. You can click on an existing record to edit its attributes, or you can
add a record with the Add Record button at top right. If there is no record explicitly set for an option, the default
value will be used. In order for changes made in the Central Config to take effect, you must to restart Domino services
using the link at the top of the interface.

101
Domino Admin Docs Documentation, Release 4.4.0

5.1.1 Project visibility options

These options are related to project visibility settings and are available in namespace common and should be recorded
with no name.

key default description


com. true If set to false, users cannot set projects to public visibility.
cerebro.
domino .
publicProjects.
enabled
com. Public Controls the default visibility setting for new projects. Options are Public or
cerebro. Private.
domino .
defaultProjectVisibility

5.1.2 Email notifications

These options are related to email notifications from Domino and are available in namespace common and should be
recorded with no name.

key default description


smtp. None The ‘from’ address for email notifications sent by Domino.
from
smtp. None Hostname of SMTP relay to use for sending emails from Domino.
host
smtp. None Username to use for authenticating to the SMTP host.
user
smtp. None Password for the SMTP user.
password
smtp. 25 Port to use for connecting to SMTP host.
port
smtp. false Whether the SMTP host uses SSL.
ssl

102 Chapter 5. Configuration


Domino Admin Docs Documentation, Release 4.4.0

5.1.3 Model APIs

These options are related to Model APIs and are available in namespace common and should be recorded with no
name.

key default description


com. 2 Default number of instances per Model used for Model API scaling.
cerebro.
domino.
modelmanager
.
instances.
defaultNumber
com. 32 Maximum number of instances per Model used for Model API scaling.
cerebro.
domino.
modelmanager
.
instances.
maximumNumber
com. Key used in Kubernetes label node selector for Model API pods.
dominodatalab.
cerebro. com/
domino. node-pool
modelManager
.
nodeSelectorLabelKey
com. default Value used in Kubernetes label node selector for Model API pods.
cerebro.
domino.
modelManager
.
nodeSelectorLabelValue

5.1. Central Configuration 103


Domino Admin Docs Documentation, Release 4.4.0

5.1.4 Environments

These options are related to Domino Environments and are available in namespace common and should be recorded
with no name.

key default description


com. true If set to false only system administrators will be able to edit environments.
cerebro.
domino.
environments
.
canNonSysAdminsCreateEnvironments
com. quay. Docker image URI for the initial default environment.
cerebro. io/
domino. domino/
environments
base:
. Ubuntu18_DAD_Py3.
default. 6_R3.
image 6_20190918
com. Domino Name of the initial default environment.
cerebro. Analytics
domino. Distribu-
tion Py3.6
environments
. R3.6
default.
name

5.1.5 Authentication

These options are related to the Keycloak authentication service and are available in namespace common and should
be recorded with no name.

key default description


authentication.
false When true Domino will manage Organization membership via users’ group SAML
oidc . attributes
externalOrgsEnabled
authentication.
false When true Domino will manage Admin roles assignments via users’ role SAML
oidc . attributes
externalRolesEnabled

104 Chapter 5. Configuration


Domino Admin Docs Documentation, Release 4.4.0

5.1.6 Long-running workspaces

These options are related to long-running workspace sessions and are available in namespace common and should be
recorded with no name.

key default description


com. 259200 Defines how long a workspace must run in seconds before the workspace is classified
cerebro. as ‘long-running’ and begins to generate notifications or becomes subject to automatic
domino. shutdown.
workloadNotifications
.
longRunningWorkloadDefinitionInSeconds
com. false Set to true to enable the option for email notifications to users when their
cerebro. workspaces become long-running. Users can turn these notifications on or off for
domino. themselves in their account settings.
workloadNotifications
.
isEnabled
com. false Set to true to turn on long-running workspace notifications for all users. While this
cerebro. is true users cannot turn off long-running workspace notifications.
domino.
workloadNotifications
.
isRequired
com. 7200 Maximum time in seconds users may set as the period between receiving long-running
cerebro. notification emails. Users will receive repeated notifications about long-running
domino. workspaces with this frequency.
workloadNotifications
.
maximumPeriodInSeconds
com. false Set to true to enable automatic shutdown of long-running workspaces. Users can
cerebro. turn automatic shutdown for their workspaces on or off from their account settings.
domino.
workspaceAutoShutdown
.
isEnabled
com. false Set to true to turn on automatic shutdown of long-running workspaces for all users.
cerebro. While this is true users cannot turn off automatic shutdown of their long-running
domino. workspaces.
workspaceAutoShutdown
.
isRequired
com. 259200 Longest time in seconds a long-running workspace will be allowed to continue before
cerebro. automatic shutdown. Users cannot set their automatic shutdown timer to be longer
domino. than this.
workspaceAutoShutdown
.
globalMaximumLifetimeInSeconds

5.1. Central Configuration 105


Domino Admin Docs Documentation, Release 4.4.0

5.1.7 Datasets scratch spaces

These options are related to datasets scratch spaces and are available in namespace common and should be recorded
with no name.

key default description


com. 5.5 This option sets the first datasets scratch space risk threshold in days. Scratch spaces
cerebro. with changes that have not been recorded as a snapshot for this duration are marked
domino. as medium risk.
dataset
.
scratch.
riskThresholdOneInDays
com. 10 This option controls the second datasets scratch space risk threshold in days. Scratch
cerebro. spaces with changes that have not been recorded as a snapshot for this duration are
domino. marked as high risk.
dataset
.
scratch.
riskThresholdTwoInDays

5.1.8 Compute grid

These options are related to the compute grid and are available in namespace common and should be recorded with
no name.

106 Chapter 5. Configuration


Domino Admin Docs Documentation, Release 4.4.0

key default description


com. 10min Controls how often the garbage collector runs to delete old or excess persistent vol-
cerebro. umes.
domino.
computegrid
.
kubernetes.
volume.
gcFrequency
com. None Setting a value in minutes here will cause persistent volumes older than that to be
cerebro. automatically deleted by the garbage collector.
domino.
computegrid
.
kubernetes.
volume.
maxAge
com. 32 Maximum number of idle persistent volumes to keep. Idle volumes in excess of this
cerebro. number will be deleted by the garbage collector.
domino.
computegrid
.
kubernetes.
volume.
maxIdle
com. 64 Maximum number of salvaged volumes to keep. Salvaged volumes in excess of this
cerebro. number will be deleted by the garbage collector.
domino.
computegrid
.
kubernetes.
volume.
maxSalvaged
com. 7d Setting a value in days here will cause salvaged volumes older than that to be auto-
cerebro. matically deleted by the garbage collector.
domino.
computegrid
.
kubernetes.
volume.
maxSalvagedAge
com. dominodiskKubernetes storage class that will be used to dynamically provision persistent vol-
cerebro. umes. This is set initially to the value of storage_classes.block.name in
domino. the installer storage classes configuration.
computegrid
.
kubernetes.
volume.
storageClass
com. 15 Size in GB of compute grid persistent volumes. This is the total amount of disk space
cerebro. available to users in runs and workspaces.
domino.
computegrid
.
kubernetes.
5.1. Central Configuration
volume. 107
volumesSizeInGB
com. 25 This is the maximum number of executions each user will be allowed to run con-
cerebro. currently. If a user attempts to start additional executions in excess of this those
Domino Admin Docs Documentation, Release 4.4.0

5.1.9 On-demand Spark

These options are related to the on-demand Spark clusters and are available in namespace common and should be
recorded with no name.

key default description


com. 1 Frequency in seconds to run status checks on on-demand Spark clusters.
cerebro.
domino.
integrations
.spark.
checkClusterStatusIntervalSeconds
com. /tmp File system path on which Spark worker storage is mounted.
cerebro.
domino.
integrations
.spark.
onDemand.
workerStorageMountPath
com. None Option to supply alternative default configuration directory for on-demand Spark clus-
cerebro. ters.
domino.
integrations
.spark.
sparkConfDirDefault
com. 384 Minimum amount of memory in MiB to use for Spark worker overhead.
cerebro.
domino.
workbench
.
onDemandSpark.
worker.
memoryOverheadMinMiB
com. 0.1 Spark worker overhead scaling factor.
cerebro.
domino.
workbench
.
onDemandSpark.
worker.
memoryOverheadFactor

108 Chapter 5. Configuration


Domino Admin Docs Documentation, Release 4.4.0

5.1.10 File download API

These options are related to the file contents download API endpoint and are available in namespace common and
should be recorded with no name.

key default description


com. false Set to true to require an admin API key to download files via API. When false,
cerebro. any user with the blob ID for a file may download it via API.
domino .
restrictBlobApi

5.1.11 Builder

These options are related to the Domino builder.


The Domino builder is a container that runs as a Kubernetes job to build the Docker images for Domino environments
and Domino model APIs. This container is deployed to a node labeled with a configurable Kubernetes label (defaults
to domino/build-node=true) whenever a user triggers an environment or model build.

key default description


com. domino/ Node label key that the selector in the pod specification for the builder job will target.
cerebro. build-node
domino .
builder.
nodeSelectorLabelKey
com. true Node label value that the selector in the pod specification for the builder job will
cerebro. target.
domino .
builder.
nodeSelectorLabelValue
com. /var/ The builder job mounts the host Docker socket to execute builds. This should point
cerebro. run/ to a path on the builder nodes where a Docker socket file can be mounted as part of
domino . docker. the builder job pod specification.
builder. sock
docker.
socketPath

5.1. Central Configuration 109


Domino Admin Docs Documentation, Release 4.4.0

5.1.12 Workspaces

These options are related to Domino workspaces.

110 Chapter 5. Configuration


Domino Admin Docs Documentation, Release 4.4.0

key default description


com.cerebro. 5 Controls default allocated persistent volume size for a new workspace.
domino.
workbench.
project .
defaultVolumeSizeGiB
com.cerebro. 4 Controls min allocated persistent volume size for a new workspace.
domino.
workbench.
project .
minVolumeSizeGiB
com.cerebro. 200 Controls max allocated persistent volume size for a new workspace.
domino.
workbench.
project .
maxVolumeSizeGiB
com.cerebro. 2 Sets a limit on the number of provisioned workspaces per user per project.
domino.
workbench.
workspace .
maxWorkspacesPerUserPerProject
com.cerebro. 8 Sets a limit on the number of provisioned workspaces per user across all
domino. projects.
workbench.
workspace .
maxWorkspacesPerUser
com.cerebro. 1500 Sets a limit on the number of provisioned workspaces across the whole
domino. Domino.
workbench.
workspace .
maxWorkspaces
com.cerebro. None Sets a limit on the total volume size of all provisioned workspaces across
domino. the whole Domino combined.
workbench.
workspace .
maxAllocatedVolumeSizeAcrossAllWorkspacesGiB
com.cerebro. 20. The number of seconds the frontend waits after the workspace stops before
domino. seconds making the delete request to the backend. This allows for enough time after
workbench. workspace stop for the workspace’s persistent volume to be released. If
workspace . users frequently receive an error after trying a delete, then this value should
stopToDeleteDelayDuration be increased.
com.cerebro. false Whether or not to capture snapshots of workspace persistent volumes.
domino.
workbench.
workspace.
volume .
enableSnapshots
com.cerebro. 1.day How often to delete all but the X most recent snapshots. Where X is a
domino. number defined by workbench.workspace.volume.numSnapshotsToRetain
workbench.
workspace.
volume .
snapshotCleanupFrequency
com.cerebro. 5 The number of snapshots to retain. All older snapshots beyond this limit
domino. will be deleted during a periodic cleanup.
5.1. Central Configuration
workbench. 111
workspace.
volume .
numSnapshotsToRetain
Domino Admin Docs Documentation, Release 4.4.0

5.1.13 Authorization

These options are related to authorization and user roles.

key default description


com.cerebro. A comma-separated set of roles that will be assigned to a newly created user
Practitioner
domino. if no other roles are specified.
frontend.
authentication.
defaultRoles
com.cerebro. false If true only SupportStaff and SysAdmins can create launchers and sched-
domino. ule runs
restrictPublishing
com.cerebro. false If true only Project Owners can manage project collaborators
domino.
authorization.
restrictManageCollaborators
com.cerebro. false If true only SupportStaff and SysAdmins can manage project collabora-
domino. tors and visibility or transfer project ownership
authorization.
restrictProjectSharing

5.2 Change the default project for new users

• Overview
• Setting custom new-user default projects

5.2.1 Overview

By default, every new user in Domino is the owner of a quick-start project. This project is created when the user
signs up, and it contains many useful sample files that show how to take advantage of Domino features, plus a detailed
README.

112 Chapter 5. Configuration


Domino Admin Docs Documentation, Release 4.4.0

Admins can replace the default quick-start with one or more customized new-user default projects.

5.2.2 Setting custom new-user default projects

First, create the projects that you want all new users to own a copy of upon signup. These projects should have names,
descriptions, and READMEs that make it clear to new users what they’ll find in the project. These projects can be
owned by any user, however they should be Private projects.
Record the username and project name paths for these projects. For example:
admin-user/getting-started-project
admin-user/sample-app-project

Note that the name of the project will be reproduced for new users. If you set the example projects above as default
projects, all new users will own copies at:
new-username/getting-started-project
new-username/sample-app-project

Once your projects are ready for use by new users, set the following central configuration option.

Namespace: common
Key: com.cerebro.domino.frontend.overrideDefaultProject
Value: string of comma separated project paths

For the examples shown above, the value of this setting would be:
admin-user/getting-started-project, admin-user/sample-app-project

5.2. Change the default project for new users 113


Domino Admin Docs Documentation, Release 4.4.0

5.3 Project stage configuration

As a data science leader, you have the ability to define a set of custom project stages that users in Domino can use to
label their projects for creating useful views in the Projects Portfolio. These stages can be used to mark a project’s
progress through the workflow and life cycle your team uses. To learn more about how users interact with and set
project stages, read about stage and status in the projects overview
To set up the stages that will be available to users in your Domino platform, open the Admin interface, then click
Advanced > Project Stage Configuration.

On the project stage configuration interface, you can click Add Record to create a new stage label that will be available
for Domino users to set on their projects. The record at the top of the list is the default stage all new projects created
in Domino will have, and projects can be changed to any other available stage.

114 Chapter 5. Configuration


Domino Admin Docs Documentation, Release 4.4.0

These stages are a custom set of labels that allow your Domino users to communicate progress in a project to their
colleagues and to leadership. It’s up to you as a data science leader to determine the stages that you want available,
and to communicate to your team how they should be used.
Domino recommends setting up a custom default project for new users with information in the README about your
teams practices, available environments, and how users should use project stages.

5.4 Domino integration with Atlassian Jira

Domino can integrate with Atlassian Jira to enable users to interact with Jira from inside a Domino project.
This document describes how to link Domino to Jira. Once this configuration is done, users with a Domino account
and a Jira account can link them via OAuth.

5.4.1 Requirements

Domino supports both Jira Cloud and Jira Server version 7.1.6+. For Jira integration to work, an application link needs
to be configured between Domino and Jira.
This process requires system administrator access to Domino and also a Jira account with admin permissions.

5.4.2 Configuration

Step 1:
In the Domino admin UI, under Advanced click Feature Flags. Set ShortLived.JiraIntegrationEnabled
to True.
Step 2:
In the Domino admin UI, under Advanced click Jira Configuration. Provide the URL of your Jira service then click
Add configuration.

5.4. Domino integration with Atlassian Jira 115


Domino Admin Docs Documentation, Release 4.4.0

You will need the details on this page in subsequent steps. Please note the Public Key, Incoming Consumer
Key and Incoming Consumer Name as these won’t be visible once you move away from this screen.
This step adds the relevant central config values and need a restart of the Domino services. Click the restart services
link.
Step 3:
Log in to your Atlassian/Jira account. Note that you would need to have admin privileges on this account to proceed
further.
1. Click the gear icon to open setting, then click Products

116 Chapter 5. Configuration


Domino Admin Docs Documentation, Release 4.4.0

2. Click Application Link under the Integrations section


3. Provide your Domino URL then click Create New Link
4. A popup box with your URL pre-filled should appear. Ignore the warning that you see and click continue.

5.4. Domino integration with Atlassian Jira 117


Domino Admin Docs Documentation, Release 4.4.0

5. Provide the Application Name that was generated in Step 2 above


6. You can leave rest of the field empty or if it’s required for your deployment, fill it with dummy values
7. It’s mandatory to click on the checkbox at the end of the form before you click Continue. Please note the below
screenshot for reference

118 Chapter 5. Configuration


Domino Admin Docs Documentation, Release 4.4.0

5.4. Domino integration with Atlassian Jira 119


Domino Admin Docs Documentation, Release 4.4.0

8. Click on Continue and in the next form and provide the Consumer Name, Consumer Key and Public Key from
Step 2

5.4.3 Dashboard

All projects which have a Jira ticket linked to them will be visible in the Jira Configuration page (Admin -> Advanced
-> Jira Configuration). An Admin can choose to unlink projects directly from this screen for all the projects.

5.4.4 Reconfigure / Remove Jira Integration

Domino can be reconfigured to use another Jira instance or delete the configuration with the following steps:
1. Unlink all jira linked projects
2. Go to Jira Configuration Page and delete current configuration
3. Follow the steps in Configuration section to link a new connection

120 Chapter 5. Configuration


CHAPTER 6

Compute

Compute nodes available to run user workloads in Domino are conceptually organized into node pools based on
their Kubernetes labels. The set of pools available to users is referred to as the Domino Compute Grid, and it is the
responsibility of Domino administrators to manage and configure these pools.

6.1 Managing the Compute Grid

6.1.1 Overview

Users in Domino assign their Runs to Domino Hardware Tiers. A hardware tier defines the type of machine a job will
run on, and the resource requests and limits for the pod that the Run will execute in. When configuring a hardware
tier, you will specify the machine type by providing a Kubernetes node label.
You should create a Kubernetes node label for each type of node you want available for compute workloads in Domino,
and apply it consistently to compute nodes that meet that specification. Nodes with the same label become a node pool,
and they will be used as available for Runs assigned to a Hardware Tier that points to their label.
Which pool a Hardware Tier is configured to use is determined by the value in the Node Pool field of the Hardware
Tier editor. In the screenshot below, the large-k8s Hardware Tier is configured to use the default node pool.

121
Domino Admin Docs Documentation, Release 4.4.0

The diagram below shows a cluster configured with two node pools for Domino, one named default and one named
default-gpu. You can make additional node pools available to Domino by labeling them with the same scheme:
dominodatalab.com/node-pool=<node-pool-name>. The arrows in this diagram represent Domino re-
questing that a node with a given label be assigned to a Run. Kubernetes will then assign the Run to a node in the
specified pool that has sufficient resources.

122 Chapter 6. Compute


Domino Admin Docs Documentation, Release 4.4.0

By default, Domino creates a node pool with the label dominodatalab.com/node-pool=default and all
compute nodes Domino creates in cloud environments are assumed to be in this pool. Note that in cloud environments
with automatic node scaling, you will configure scaling components like AWS Auto Scaling Groups or Azure Scale
Sets with these labels to create elastic node pools.

6.1.2 Kubernetes pods

Every Run in Domino is hosted in a Kubernetes pod on a type of node specified by the selected Hardware Tier.
The pod hosting a Domino Run contains three containers:
1. The main Run container where user code is executed
2. An NGINX container for handling web UI requests
3. An executor support container which manages various aspects of the lifecycle of a Domino execution, like
transferring files or syncing changes back to the Domino file system

6.1. Managing the Compute Grid 123


Domino Admin Docs Documentation, Release 4.4.0

6.1.3 Resourcing requests

The amount of compute power required for your Domino cluster will fluctuate over time as users start and stop Runs.
Domino relies on Kubernetes to find space for each execution on existing compute resources. In cloud autoscaling
environments, if there’s not enough CPU or memory to satisfy a given execution request, the Kubernetes cluster
autoscaler will start new compute nodes to fulfill that increased demand. In environments with static nodes, or in
cloud environments where you have reached the autoscaling limit, the execution request will be queued until resources
are available.
Autoscaling Kubernetes clusters will shut nodes down when they are idle for more than a configurable duration. This
reduces your costs by ensuring that nodes are used efficiently, and terminated when not needed.
Cloud autoscaling resources have properties like the minimum and maximum number of nodes they can create. You
should set the node maximum to whatever you are comfortable with given the size of your team and expected volume
of workloads. All else equal, it is better to have a higher limit than a lower one, as nodes are cheap to start up and
shut down, while your data scientists’ time is very valuable. If the cluster cannot scale up any further, your users’
executions will wait in a queue until the cluster can service their request.
The amount of resources Domino will request for a Run is determined by the selected Hardware Tier for the Run.
Each Hardware Tier has five configurable properties that configure the resource requests and limits for Run pods.
• Cores
The number of requested CPUs.
• Cores limit
The maximum number of CPUs. Recommended to be the same as the request.
• Memory
The amount of requested memory.
• Memory limit
The maximum amount of memory. Recommended to be the same as the request.
• Number of GPUs
The number of GPU cards available.
The request values, Cores and Memory, as well as Number of GPUs, are thresholds used to determine whether a node
has capacity to host the pod. These requested resources are effectively reserved for the pod. The limit values control
the amount of resources a pod can use above and beyond the amount requested. If there’s additional headroom on the
node, the pod can use resources up to this limit.
However, if resources are in contention, and a pod is using resources beyond those it requested, and thereby causing
excess demand on a node, the offending pod may be evicted from the node by Kubernetes and the associated Domino
Run is terminated. For this reason, Domino strongly recommends setting the requests and limits to the same values.

124 Chapter 6. Compute


Domino Admin Docs Documentation, Release 4.4.0

6.1.4 User Executions Quota

To prevent a single user from monopolizing a Domino deployment, an administrator can set a limit on the number
of simultaneous executions that a user can have running concurrently. Once the number of simultaneously running
executions is reached for a given user, any additional executions will be queued. This includes executions for Domino
workspaces, jobs, web applications, as well as any executions that make up an on-demand distributed compute cluster.
For example, in the case of an on-demand Spark cluster an execution slot will be consumed for each Spark executor
and for the master.
See Important settings for details.

6.1.5 Common questions

How do I view the current nodes in my compute grid?

From the top menu bar in the admin UI, click Infrastructure. You will see both Platform and Compute nodes in this
interface. Click the name of a node to get a complete description, including all applied labels, available resources, and
currently hosted pods. This is the full kubectl describe for the node. Non-Platform nodes in this interface with
a value in the Node Pool column are compute nodes that can be used for Domino Runs by configuring a Hardware
Tier to use the pool.

6.1. Managing the Compute Grid 125


Domino Admin Docs Documentation, Release 4.4.0

How do I view details on currently active executions?

From the top menu of the admin UI, click Executions. This interface lists active Domino execution pods and shows
the type of workload, the Hardware Tier used, the originating user and project, and the status for each pod. There
are also links to view a full kubectl describe output for the pod and the node, and an option to download the
deployment lifecycle log for the pod generated by Kubernetes and the Domino application.

126 Chapter 6. Compute


Domino Admin Docs Documentation, Release 4.4.0

How do on-demand Spark clusters show up in the active executions interface?

Each Spark node, including master and worker nodes, launched as part of an on-demand Spark cluster will be displayed
as a separate row in the executions interface, with complete information available on the originating project and user,
as well as the hardware tier.

How do I create or edit a Hardware Tier?

From the top menu of the admin UI, click Advanced > Hardware Tiers, then on the Hardware Tiers page click New
to create a new Hardware Tier or Edit to modify an existing Hardware Tier.

6.1. Managing the Compute Grid 127


Domino Admin Docs Documentation, Release 4.4.0

Keep in mind that your Hardware Tier’s CPU, memory, and GPU requests should not exceed the available resources
of the machines in the target node pool after accounting for overhead. If you need more resources than are available on
existing nodes, you may need to add a new node pool with different specifications. This may mean adding individual
nodes to a static cluster, or configuring new auto-scaling components that provision new nodes with the required
specifications and labels.

How can I use more shared memory in my execution container?

You can allow hardware tiers to exceed the default limit of 64MB for shared memory. This is especially beneficial for
applications that can make use of shared memory.
From the top of the menu admin UI, click Advanced > Hardware Tiers, then on the Hardware Tiers page click New
to create a new Hardware Tier or Edit to modify an existing Hardware Tier. Check the Allow executions to exceed
the default shared memory limit checkbox.
Checking this option will override the /dev/shm (shared memory) limit, and any shared memory consumption will
count toward the overall memory limit of the hardware tier. Be sure to consider and incorporate the size of /dev/shm
in any memory usage calculations for a hardware tier with this option enabled.

Warning: /dev/shm is considered part of the overall memory footprint of an execution container. It is possible
to exceed the total memory of the container when overriding dev/shm to use more shared memory. Exceeding
the container’s memory limit via dev/shm will terminate the container.

6.1.6 Important settings

The following settings in the common namespace of the Domino central configuration affect compute grid behavior.

Deploying state timeout

• Key: com.cerebro.computegrid.timeouts.sagaStateTimeouts.
deployingStateTimeoutSeconds

128 Chapter 6. Compute


Domino Admin Docs Documentation, Release 4.4.0

• Value: Number of seconds an execution pod in a deploying state will wait before timing out. Default is 60 * 60
(1 hour).

Preparing state timeout

• Key: com.cerebro.computegrid.timeouts.sagaStateTimeouts.
preparingStateTimeoutSeconds
• Value: Number of seconds an execution pod in a preparing state will wait before timing out. Default is 60 * 60
(1 hour).

Maximum executions per user

• Key: com.cerebro.domino.computegrid.userExecutionsQuota.
maximumExecutionsPerUser
• Value: Maximum number of executions each user may have running concurrently. If a user tries to run more
than this, the excess executions will queue until existing executions finish. Default is 25.

Quota state timeout

• Key: com.cerebro.computegrid.timeouts.sagaStateTimeouts.
userExecutionsOverQuotaStateTimeoutSeconds
• Value: Number of seconds an execution pod that cannot be assigned due to user quota limitations will wait for
resources to become available before timing out. Default is 24 * 60 * 60 (24 hours).

6.2 Hardware Tier best practices

• Overview
• Accounting for overhead
– Kubernetes management overhead
– Domino daemon-set overhead
– Domino execution overhead
– When should I account for overhead?
– Example
• Isolating workloads and users using node pools
• Set resource requests and limits to the same values

6.2. Hardware Tier best practices 129


Domino Admin Docs Documentation, Release 4.4.0

6.2.1 Overview

Domino Hardware Tiers define Kubernetes requests and limits and link them to specific node pools. We recommend
the following best practices.
1. Accounting for overhead
2. Isolating workloads and users using node pools
3. Setting resource requests and limits to the same values

6.2.2 Accounting for overhead

When designing hardware tiers, you need to take into account what resources will be available on a given node when
Domino submits your workload for execution. Not all physical memory and CPU cores of your node will be available
due to system overhead.
You should consider the following overhead components:
1. Kubernetes management overhead
2. Domino daemon-set overhead
3. Domino execution sidecar overhead

Kubernetes management overhead

Kubernetes typically reserves a portion of each node’s capacity for daemons and pods that are required to for Ku-
bernetes itself. The amount of reserved resources usually scales with the size of the node, and also depends on the
Kubernetes provider or distribution.
Click the links below to view information on reserved resources for cloud-provider managed Kubernetes offerings:
• AWS EKS
• Azure AKS
• Google GKE
The best way to understand the available resources for your instance is to check one of your compute nodes with the
kubectl describe nodes command and then look for the Allocatable section of the output. It will show
the memory and CPU available for Domino.

130 Chapter 6. Compute


Domino Admin Docs Documentation, Release 4.4.0

Domino daemon-set overhead

Domino runs a set of management pods that reside on each of the compute nodes. These are used for things like log
aggregation, monitoring, and environment image caching.
The overhead of these daemon-sets is roughly 0.5 CPU cores and 0.5 Gi RAM. This overhead is taken from the
allocatable resources on the node.

Domino execution overhead

Lastly, for each Domino execution, there are a set of supporting containers in the execution pod that manage authenti-
cation, handle request routing, loading files, and installing dependencies. These supporting containers make CPU and
memory requests that Kubernetes takes into account when scheduling execution pods.
The supporting container overhead currently is roughly 1 CPU core and 1.5 GiB RAM. This is configurable and may
vary for your specific deployment.

When should I account for overhead?

Overhead is relevant if you want to define a hardware tier dedicated to one execution at a time per node, such as for a
node with a single physical GPU. It is also relevant if you absolutely need to maximize node density.

Example

Consider an m5.2xlarge EC2 node with raw capacity of 8 CPU cores and 32 GiB of RAM.
When used as part of an EKS cluster, the node reports the following allocatable capacity of ~27GiB of RAM and
7910m CPU cores.

Capacity:
attachable-volumes-aws-ebs: 25
cpu: 8
ephemeral-storage: 104845292Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32120476Ki
pods: 58
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 7910m
ephemeral-storage: 95551679124
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 28372636Ki
pods: 58

On top of that above, conservatively account for 500m CPU and 0.5GiB of RAM for the Domino and EKS daemons.
Lastly, for a single execution add 1000m CPU and 1.5GiB RAM for sidecars, and you are left with roughly 6410m
CPU and 25GiB RAM that you can use for a single large hardware tier.
If you want to partition the node into smaller hardware tiers, you will need to account for the sidecar overhear for
every execution that you want to colocate.
As a general rule, larger nodes allow for more flexibility as Kubernetes will take care of efficiently packing your
executions onto the available capacity.

6.2. Hardware Tier best practices 131


Domino Admin Docs Documentation, Release 4.4.0

You can see which pods are running on a specific node by visiting the Infrastructure admin page and clicking on
the name of the node. In the image below, there is a box around the execution pods. The other pods handle logging,
caching, and other services.

6.2.3 Isolating workloads and users using node pools

Node pools are defined by labels added to nodes in a specific format: dominodatalab.com/
node-pool=<your-node-pool>. In the hardware tier form, you just need to include your-node-pool. You
can name a node pool anything you like, but we recommend naming them something meaningful given the intended
use.
Domino typically comes pre-configured with default and default-gpu node pools, with the assumption that
most user executions will run on nodes in one of those pools. As your compute needs become more sophisticated, you
may want to keep certain users separate from one another or provide specialized hardware to certain groups of users.
So if there’s a data science team in New York City that needs a specific GPU machine that other teams don’t need, you
could use the following label for the appropriate nodes: dominodatalab.com/node-pool=nyc-ds-gpu. In
the hardware tier form, you would specify nyc-ds-gpu. To ensure only that team has access to those machines,
create a NYC organization, add the correct users to the organization, and give that organization access to the new
hardware tier that uses the nyc-ds-gpu node pool label.

6.2.4 Set resource requests and limits to the same values

With Kubernetes, resource limits must be >= resource requests. So if your memory request is 16 GiB, your limit
must be >= 16 GiB. But while setting a request > limit can be useful - there are cases where allowing bursts of CPU
or memory can be useful - this is also dangerous. Kubernetes may evict a pod using more resources than initially
requested. For Domino workspaces or jobs, this would cause the execution to be terminated.

132 Chapter 6. Compute


Domino Admin Docs Documentation, Release 4.4.0

It is for this reason that we recommend setting memory and CPU requests equal to limits. In this case, Python and R
cannot allocate more memory than the limit, and execution pods will not be evicted.
On the other hand, if the limit is higher than the request, it is possible for a user to use resources that another user’s
execution pod should be able to access. This is the “noisy neighbor” problem that you may have experienced in other
multi-user environments. But instead of allowing the noisy neighbor to degrade performance for other pods on the
node, Kubernetes will evict offending pod when necessary to free up resources.
User data on disk will not be lost, because Domino stores user data on a persistent volume that can be reused. But
anything in memory will be lost and the execution will have to be restarted.

6.3 Model resource quotas

6.3.1 Overview

The pods that host Model APIs have hardware specifications based on resource quotas set by Domino system admin-
istrators. A resource quota determines the CPU and memory resources available to the Model that uses it.
Users will be able to access resource quotas from a dropdown menu on the Model deployment page.

6.3. Model resource quotas 133


Domino Admin Docs Documentation, Release 4.4.0

6.3.2 Creating and editing resource quotas

From the admin home, click Advanced -> Resource Quotas to open the management interface.

From here you can create, edit, and set default resource quotas. Resource quotas cannot be permanently deleted. To
make a resource quota unavailable for use, edit it and set Visible to false.
Resource quotas have the following properties:
• CPUs requested The number of cores that will be reserved for a Model with this quota.
• Memory requested The amount of RAM that will be reserved for a model with this quota.
• CPU limit If the hosting node has idle cores available, a model running this quota can make use of additional
cores up to this limit.
• Memory limit If the hosting node has RAM available, a model running this quota can make use of additional
memory up to this limit.
• Visible This property on a resource quota must be set to true for the quota to appear in the dropdown selector
for users publishing Models.
• Default The resource quota with this set to true is the quota that will be used for all newly published Models
by default.

6.4 Persistent volume management

134 Chapter 6. Compute


Domino Admin Docs Documentation, Release 4.4.0

• Overview
• Definitions
• Storage workflow for Jobs
• Storage workflow for Workspaces
• Resumable Workspace volume backups on AWS
• Garbage collection
• Salvaged volumes
• FAQ

6.4.1 Overview

When not in use, Domino project files are stored and versioned in the Domino blob store. When a Domino run is
started from a project, the projects files are copied to a Kubernetes persistent volume that is attached to the compute
node and mounted in the run.

6.4.2 Definitions

• Persistent Volume (PV)


A storage volume in a Kubernetes cluster that can be mounted to pods. Domino dynamically creates persistent
volumes to provide local storage for active runs.
• Persistent Volume Claim (PVC)
A request made in Kubernetes by a pod for storage. Domino uses these to correctly match a new run with either
a new PV or an idle PV that has the project’s files cached.
• Idle Persistent Volume
A PV that was used by a previous run, and which is currently not being used. Idle PV’s will either be re-used
for a new run or garbage collected.
• Storage Class
Kubernetes method of defining the type, size, provisioning interface, and other properties of storage volumes.

6.4. Persistent volume management 135


Domino Admin Docs Documentation, Release 4.4.0

6.4.3 Storage workflow for Jobs

When a user starts a new job, Domino will broker assignment of a new execution pod to the cluster. This pod will
have an associated PVC which defines for Kubernetes what type of storage it requires. If an idle PV exists matching
the PVC, Kubernetes will mount that PV on the node it assigns to host the pod, and the job or workspace will start. If
an appropriate idle PV does not exist, Kubernetes will create a new PV according to the Storage Class.
When the user completes their workspace or job, the PV data will be written to the Domino File System, and the PV
will be unmounted and sit idle until it is either reused for the user’s next job or garbage collected. By reusing PV’s,
users who are actively working in a project will not need to copy data from the blob store to a PV repeatedly.
A job will only match with either a fresh PV or one previously used by that project. PV’s are not reused between
projects.

6.4.4 Storage workflow for Workspaces

Workspace volumes are handled differently than volumes for Jobs. Workspaces are potentially long lived development
environments that users will stop and resume repeatedly without writing data back to the Domino File System each
time. As a result, the PV for the workspace is a similarly long-lived resource that stores the user’s working data.
These workspace PVs are durably associated with the resumable workspace they are initially created for. Each time
that workspace is stopped, the PV is detached and preserved so that it’s available the next time the user starts the
workspace. When the workspace starts again, it reattaches its PV and the user will see all of their working data saved
during the last session.
Only when a user chooses to initiate a sync will the contents of their project files in the workspace PV be written
back to the Domino File System. A resumable workspace PV will only be deleted if the user deletes the associated
workspace.

6.4.5 Resumable Workspace volume backups on AWS

Since the data in resumable workspace volumes is not automatically written back to the Domino File System, there is
a risk of lost work should the volume be lost or deleted. When Domino is running on AWS, it safeguards against this
by backing up the EBS volume that backs the workspace PV with EBS snapshotting to S3. If you have accidentally
deleted or lost a resumable workspace volume that contains data you want to recover, contact Domino support for
assistance in restoring from the snapshot.

136 Chapter 6. Compute


Domino Admin Docs Documentation, Release 4.4.0

6.4.6 Garbage collection

Domino has configurable values to help you tune your cluster to balance performance with cost controls. The more
idle volumes you allow the more likely it is that users will be able to reuse a volume and avoid needing to copy project
files from the blob store. However, this comes at the cost of keeping additional idle PVs.
By default, Domino will:
• Limit the total number of idle PV’s to 32. This can be adjusted by setting the following option in the central
config:
common com.cerebro.domino.computegrid.kubernetes.volume.maxIdle

• Terminate any idle PV that has not been used in a certain number of days. This can be adjusted by setting the
following option in the central config:
common com.cerebro.domino.computegrid.kubernetes.volume.maxAge

This value is expressed in terms of days. The default value is empty, which means unlimited. A value
of 7d will terminate any idle PV after seven days.

6.4.7 Salvaged volumes

In the scenario when a user’s job fails unexpectedly, Domino will preserve the volume so data can be recov-
ered. After a workspace or job ends, claimed PV’s are placed into one of the following states, indicated with the
dominodatalab.com/volume-state label.
• available
If the run ends normally, the underlying PV will be available for future runs.
• salvaged
If the run fails, the underlying PV will not be eligible for reuse, and is held in this state to be salvaged.
Salvaged PV’s will not be reused automatically by the future workspaces or jobs, but can be manually mounted to a
workspace in order to recover work.
By default, Domino will:
• Limit the total number of salvaged PV’s to 64. This can be adjusted by setting the following option in the central
config:
common com.cerebro.domino.computegrid.kubernetes.volume.maxSalvaged

• Terminate any salvaged PV that has not been used in a certain number of days. This can be adjusted by setting
the following option in the central config:

6.4. Persistent volume management 137


Domino Admin Docs Documentation, Release 4.4.0

common com.cerebro.domino.computegrid.kubernetes.volume.maxSalvagedAge

The value is expressed in terms of days. The default value is seven days. A value of 14d will
terminate any salvaged PV after fourteen days.
To recover a salavaged volume,
1. Find the PV that was attached to your job or workspace, which will be in the Deployment logs for your job or
workspace.
2. Create a pod attached to the salvaged volume.
3. Recover the files with your most convenient method (scp, AWS CLI, kubectl cp, etc.)
This script will do Step 2 and will provide the appropriate commands in its output. Remember to delete the PVC and
PV, otherwise these resources will continue to be used.

6.4.8 FAQ

How do I see the current PV’s in my cluster?


Run the following command to see all current PV’s sorted by last-used:

kubectl get pv --sort-by='.metadata.annotations.dominodatalab.com/last-used'

How do I change the size of the storage volume for my jobs or workspaces?
You can set the volume size for new PV’s by editing the following central config value:

Names- Key Value


pace
common com.cerebro.domino.computegrid.kubernetes.volume. Volume size in GB (de-
volumesSizeInGB fault 15)

6.5 Adding a node pool to your Domino cluster

• Overview
• Creating a scalable node pool in EKS

138 Chapter 6. Compute


Domino Admin Docs Documentation, Release 4.4.0

6.5.1 Overview

Making a new node group available to Domino is as simple as adding new Kubernetes worker nodes with a distinct
dominodatalab.com/node-pool label. You can then reference the value of that label when creating new
hardware tiers to configure Domino to assign executions to those nodes.
See below for an example of creating a scalable node pool in EKS.

6.5.2 Creating a scalable node pool in EKS

This example shows how to create a new node group with eksctl and expose it to the cluster autoscaler as a labeled
Domino node pool.
1. Create a new-nodegroup.yaml file like the one below, and configure it with the properties you want the
new group to have. All values shown with a $ are variables that you should modify.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: $CLUSTER_NAME
region: $CLUSTER_REGION
nodeGroups:
- name: $GROUP_NAME # this can be any name you choose, it will be part of the
˓→ASG and template name

instanceType: $AWS_INSTANCE_TYPE
minSize: $DMINIMUM_GROUP_SIZE
maxSize: $DESIRED_MAXIMUM_GROUP_SIZE
volumeSize: 400 # important to allow for image caching on Domino workers
availabilityZones: ["$YOUR_CHOICE"] # this should be the same AZ (or the same
˓→multiple AZ's) as your other node pools

ami:
$AMI_ID
labels:
"dominodatalab.com/node-pool": "$NODE_POOL_NAME" # this is the name you'll
˓→reference from Domino

# "nvidia.com/gpu": "true" # uncomment this line if this pool uses a GPU


˓→instance type

tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool
˓→": "$NODE_POOL_NAME"

# "k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu": "true" #
˓→uncomment this line if this pool uses a GPU instance type

Note that the AWS tag with key k8s.io/cluster-autoscaler/node-template/label/


dominodatalab.com/node-pool is important for exposing the group to your cluster autoscaler.
Note also that you cannot have compute node pools in separate, isolated AZ’s as this creates volume affinity
errors.

6.5. Adding a node pool to your Domino cluster 139


Domino Admin Docs Documentation, Release 4.4.0

2. Once your configuration file describes the group you want to create, run eksctl create nodegroup
--config-file=new-nodegroup.yaml.
3. Take the names of the resulting ASG and add them to the autoscaling.groups section of your domino.
yml installer configuration.
4. Run the Domino installer to update the autoscaler.
5. Create a new hardware tier in Domino that references the new labels.
When finished, you can start Domino executions that use the new Hardware Tier and those executions will be assigned
to nodes in the new group, which will be scaled as configured by the cluster autoscaler.

6.6 Removing a node from service

• Overview
• Temporarily removing a node from service
• Permanently removing a node from service
– Identifying user workloads
– Dealing with long-running workloads
– Dealing with older versions of Kubernetes
• Sample commands for iterating over many nodes and/or pods

6.6.1 Overview

There may be times when you need to remove a specific node (or multiple nodes) from service, either temporarily or
permanently. This may include cases of troubleshooting nodes that are in a bad state, or retiring nodes after an update
to the AMI so that all nodes are using the new AMI.
This page describes how to temporarily prevent new workloads from being assigned to a node, as well as how to safely
remove workloads from a node so that it can be permanently retired.

6.6.2 Temporarily removing a node from service

The kubectl cordon <node> command will prevent any additional pods from being scheduled onto the node,
without disrupting any of the pods currently running on it. For example, let’s say a new node in your cluster has come

140 Chapter 6. Compute


Domino Admin Docs Documentation, Release 4.4.0

up with some problems, and you want to cordon it before launching any new runs to ensure they will not land on that
node. The procedure might look like this:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192-168-0-221.us-east-2.compute.internal Ready <none> 12d v1.14.7-eks-
˓→1861c5

ip-192-168-17-8.us-east-2.compute.internal Ready <none> 12d v1.14.7-eks-


˓→1861c5

ip-192-168-24-46.us-east-2.compute.internal Ready <none> 51m v1.14.7-eks-


˓→1861c5

ip-192-168-3-110.us-east-2.compute.internal Ready <none> 12d v1.14.7-eks-


˓→1861c5

$ kubectl cordon ip-192-168-24-46.us-east-2.compute.internal


node/ip-192-168-24-46.us-east-2.compute.internal cordoned
$ kubectl get no
NAME STATUS ROLES AGE
˓→ VERSION
ip-192-168-0-221.us-east-2.compute.internal Ready <none> 12d
˓→ v1.14.7-eks-1861c5
ip-192-168-17-8.us-east-2.compute.internal Ready <none> 12d
˓→ v1.14.7-eks-1861c5
ip-192-168-24-46.us-east-2.compute.internal Ready,SchedulingDisabled <none> 53m
˓→ v1.14.7-eks-1861c5
ip-192-168-3-110.us-east-2.compute.internal Ready <none> 12d
˓→ v1.14.7-eks-1861c5

Notice the SchedulingDisabled status on the cordoned node.


You can undo this and return the node to service with the command kubectl uncordon <node>.

6.6.3 Permanently removing a node from service

Identifying user workloads

Before removing a node from service permanently, you should ensure there are no workloads still running on it
that should not be disrupted. For example, you might see the following workloads running on a node (notice the
specification of the compute namespace with -n and wide output to include the node hosting the pod with -o):
$ kubectl get po -n domino-compute -o wide | grep ip-192-168-24-46.us-east-2.compute.
˓→internal

run-5e66acf26437fe0008ca1a88-f95mk 2/2 Running 0 23m


˓→ 192.168.4.206 ip-192-168-24-46.us-east-2.compute.internal <none>
˓→<none>

run-5e66ad066437fe0008ca1a8f-629p9 3/3 Running 0 24m


˓→ 192.168.28.87 ip-192-168-24-46.us-east-2.compute.internal <none>
˓→<none>

(continues on next page)

6.6. Removing a node from service 141


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


run-5e66b65e9c330f0008f70ab8-85f4f5f58c-m46j7 3/3 Running 0 51m
˓→ 192.168.23.128 ip-192-168-24-46.us-east-2.compute.internal <none>
˓→<none>

model-5e66ad4a9c330f0008f709e4-86bd9597b7-59fd9 2/2 Running 0 54m


˓→ 192.168.28.1 ip-192-168-24-46.us-east-2.compute.internal <none>
˓→<none>

domino-build-5e67c9299c330f0008f70ad1 1/1 Running 0 3s


˓→ 192.168.13.131 ip-192-168-24-46.us-east-2.compute.internal <none>
˓→<none>

Different types of workloads should be treated differently. You can see the details of a particular workload with
kubectl describe po run-5e66acf26437fe0008ca1a88-f95mk -n domino-compute. The la-
bels section of the describe output is particularly useful to distinguish the type of workload, as each of the workloads
named as run-... will have a label like dominodatalab.com/workload-type=<type of workload>.
The example above contains one each of the major user workloads:
• run-5e66acf26437fe0008ca1a88-f95mk is a 4.4.0, with label dominodatalab.com/
workload-type=Batch. It will stop running on its own once it is finished and disappear from the
list of active workloads.
• run-5e66ad066437fe0008ca1a8f-629p9, is a 4.4.0, with label dominodatalab.com/
workload-type=Workspace. It will keep running until the user who launched it shut it down.
You have the option of contacting users to shut down their workspaces, waiting a day or two in the expectation
they will shut them down naturally, or removing the node with the workspaces still running. (The last option
is not recommended unless you are certain there is no un-synced work in any of the workspaces and have
communicated with the users about the interruption.)
• run-5e66b65e9c330f0008f70ab8-85f4f5f58c-m46j7, is an 4.4.0, with label dominodatalab.
com/workload-type=App. It is a long-running process, and is governed by a kubernetes deployment.
It will be recreated automatically if you destroy the node hosting it, but will experience whatever downtime is
required for a new pod to be created and scheduled on another node. See below for methods to proactively move
the pod and reduce downtime.
• model-5e66ad4a9c330f0008f709e4-86bd9597b7-59fd9, is a 4.4.0. It does not have a
dominodatalab.com/workload-type label, and instead is easily identifiable by the pod name. It is
also a long-running process, similar to an app, with similar concerns. See below for methods to proactively
move the pod and reduce downtime.
• domino-build-5e67c9299c330f0008f70ad1 is a 4.4.0. It will finish on its own and go into a
Completed state.

Dealing with long-running workloads

For the long-running workloads governed by a Kubernetes deployment, you can proactively move the pods off of the
cordoned node by running a command like this:

$ kubectl rollout restart deploy model-5e66ad4a9c330f0008f709e4 -n domino-compute

Notice the name of the deployment is the same as the first part of the name of the pod in the above section. You can see
a list of all deployments in the compute namespace by running kubectl get deploy -n domino-compute.
Whether the associated app or model experiences any downtime will depend on the update strategy of the deployment.
For the two example workloads above in a test deployment, one App and one Model API, we have the following

142 Chapter 6. Compute


Domino Admin Docs Documentation, Release 4.4.0

(describe output filtered here for brevity):


$ kubectl describe deploy run-5e66b65e9c330f0008f70ab8 -n domino-compute | grep -i
˓→"strategy\|replicas:"

Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable


StrategyType: RollingUpdate
RollingUpdateStrategy: 1 max unavailable, 1 max surge

$ kubectl describe deploy model-5e66ad4a9c330f0008f709e4 -n domino-compute | grep -i


˓→"strategy\|replicas:"

Replicas: 2 desired | 2 updated | 2 total | 2 available | 0 unavailable


StrategyType: RollingUpdate
RollingUpdateStrategy: 0 max unavailable, 25% max surge

The App in this case would experience some downtime, since the old pod will be terminated immediately (1 max
unavailable with only 1 pod currently running). The model will not experience any downtime since the termina-
tion of the old pod will be forced to wait until a new pod is available (0 max unavailable). If desired, you can
edit the deployments to change these settings and avoid downtime.

Dealing with older versions of Kubernetes

Earlier versions of kubernetes do not have the kubectl rollout restart command, but a similar effect can
be achieved by “patching” the deployment with a throwaway annotation like this:
$ kubectl patch deploy run-5e66b65e9c330f0008f70ab8 -n domino-compute -p '{"spec":{
˓→"template":{"metadata":{"annotations":{"migration_date":"'$(date +%Y%m%d)'"}}}}}'

The patching process will respect the same update strategies as the above restart command.

6.6.4 Sample commands for iterating over many nodes and/or pods

In cases where you need to retire many nodes, it can be useful to loop over many nodes and/or workload pods in a
single command. Customizing the output format of kubectl commands, appropriate filtering, and combining with
xargs makes this possible.
For example, to cordon all nodes in the default node pool, you can run the following:
$ kubectl get nodes -l dominodatalab.com/node-pool=default -o custom-columns=:.
˓→metadata.name --no-headers | xargs kubectl cordon

To view only apps running on a particular node, you can filter using the labels discussed above:
$ kubectl get pods -n domino-compute -o wide -l dominodatalab.com/workload-type=App |
˓→grep <node-name>

To do a rolling restart of all model pods (over all nodes), you can run:
$ kubectl get deploy -n domino-compute -o custom-columns=:.metadata.name --no-headers
˓→| grep model | xargs kubectl rollout restart -n domino-compute deploy
(continues on next page)

6.6. Removing a node from service 143


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)

When constructing such commands for larger maintenance, always run the first part of the command by itself to verify
that the list of names being passed to xargs and to the final kubectl command are what you expect!

144 Chapter 6. Compute


CHAPTER 7

Keycloak authentication service

Domino uses Keycloak, an enterprise-grade open source authentication service to manage users and logins. Keycloak
runs in a pod in the Domino Platform. There are three modes you can use for identity management in Domino:
1. Local usernames and passwords
2. Identity federation to LDAP / AD
3. Identity brokering to a SAML provider for SSO

7.1 Accessing the Keycloak UI

You can access the Keycloak UI on any Domino instance at


https://<domino-domain>/auth/
Note that the trailing / in the URL is required.
To log in as the default keycloak administrator user, you will need kubectl access to the cluster to retrieve the
password from a Kubernetes secret called keycloak-http.
Run kubectl -n <domino-platform-namespace> get secret keycloak-http -o yaml to
fetch the contents of the secret. The output should look like the following:

145
Domino Admin Docs Documentation, Release 4.4.0

apiVersion: v1
data:
password: <encrypted-password>
kind: Secret
metadata:
creationTimestamp: 2019-09-09T21:23:15Z
labels:
app.kubernetes.io/instance: keycloak
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: keycloak
helm.sh/chart: keycloak-4.14.1-0.10.2
name: keycloak-http
namespace: domino
resourceVersion: "6746"
selfLink: /api/v1/namespaces/domino/secrets/keycloak-http
uid: 09009f96-d348-11e9-9ea1-0aa417381fd6
type: Opaque

Decrypt the password by running echo '<encrypted-password' | base64 --decode. With this pass-
word you will be able to log in to the Keycloak UI as the keycloak administrator user in the master realm. Read the
official Keycloak documentation on the master realm to learn more.
Keycloak will be configured automatically by Domino with a realm named DominoRealm that will be used for
Domino authentication. When reviewing or changing setting for Domino authentication, ensure that you have
DominoRealm selected in the upper left.

7.2 Local username and password configuration

The simplest option for authentication to Domino is to use local usernames and passwords. In this case all user
information is stored by Keycloak in the Postgres database, and there is no federation or brokering to other identity
providers.

146 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

7.2.1 Configuration

In this mode the key settings are on the Login tab of the DominoRealm settings page.

The one setting on this tab that is not supported is Email as username as that would automatically use email as
username and Domino currently does not support that as a valid username. Note also that if you want to use the
Verify Email option, an SMTP connection must be configured in the Email tab.

7.2.2 User management

You can add, edit, and deactivate local users from the Users menu. Click View all users to load user data.

7.2. Local username and password configuration 147


Domino Admin Docs Documentation, Release 4.4.0

7.3 LDAP / AD federation

Keycloak provides the ability to connect to an LDAP / AD identity provider and cache user information.

7.3.1 Adding a provider

This can be configured in the User Federation menu. Select ldap from the Add provider. . . dropdown menu.
For details on all available options, read the official Keycloak documentation on User storage federation.
When adding a provider according to those docs, if you are migrating from an older Domino, you can make use of
your existing ldap.conf file on the Domino frontend to see exactly what inputs you should use for the provider
settings. Some of the key pieces of information are:

ldap.conf name Keycloak user federation setting name


Search principal Bind DN
Search base Users DN
Search filter Additional Filtering

Group and Role synchronization can be configured with steps similar to those listed for SSO, except that user attributes
must first be imported to Keycloak via an LDAP mapper. Once that is done, and the users in Keycloak have the appro-
priate user attributes specifying group membership or role, the remaining setup (to map from Keycloak to Domino)
will follow the steps in the SSO group and role synchronization related to Client Mappers.
NOTE: updates to a user’s group or role will not fully synchronize to Domino until the user has a login event to
Domino.

7.3.2 Configuring mappers

In addition to configuring the LDAP connection you may also need to review the LDAP mappers associated with the
LDAP connection you have configured. Some mappers will be configured by based on the LDAP vendor that was
chosen, but you may need to modify these based on the specific configuration of your provider. You will need to make
sure that there are mappers for the following attributes:
• username
• firstName
• lastName
• email
For more details, read the official Keycloak documentation on LDAP mappers.

148 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

7.4 Single Sign-On configuration

7.4.1 Configure Single Sign-On

Domino can integrate with a SAML 2.0 or OIDC identity provider for Single Sign-On (SSO) with the steps outlined
below.

• Create a new Domino SAML service provider (SP)


• Configure alias and determine redirect URI
• Create a SAML endpoint in your upstream identity provider
– NameID policy format
– Assertion attributes
– Additional requirements
• Import metadata and complete configuration in Keycloak
– Import IdP Metadata
– Additional settings
– Exporting metadata from Keycloak
• Configure attribute mappers
– Mapping first name, last name, and email
– Mapping username
– Attribute mapping documentation
– Troubleshooting attribute mapping
• Domino First Broker Login authentication flow
• Restrict access for SSO users to Domino
– Prerequisites
– Expected SAML attributes
– Attribute mapper
– Modifying the First Broker Login authentication flow
• Customizing the SSO button
• Testing and troubleshooting
• Session and token timeouts

7.4. Single Sign-On configuration 149


Domino Admin Docs Documentation, Release 4.4.0

Create a new Domino SAML service provider (SP)

1. Log in to the Keycloak UI as the default keycloak administrator suer.


2. Click Identity Provider from the main menu.
3. Use the drop-down menu to create a new SAML 2.0 provider

Configure alias and determine redirect URI

Provide an Alias for the newly created provider. This is a unique name for the provider in Keycloak, and it will
also be part of the Redirect URI used by the provider service to route SAML responses and redirect users following
authentication.
The Redirect URI (case sensitive) will be:
https://<deployment_domain>/auth/realms/DominoRealm/broker/<alias>/endpoint
In an example deployment with domain domino.acme.org and provider alias domino-credentials the URL will be:
https://domino.acme.org/auth/realms/DominoRealm/broker/domino-credentials/
endpoint

150 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Do not save the identity provider entry yet, as you will not be able to import your provider settings once it is saved.

Create a SAML endpoint in your upstream identity provider

To complete the configuration, you need to create a SAML application in the identity provider that will be integrated
with Domino. To create the application you will need the Redirect URI from the step above.
The specific procedure for creating the SAML endpoint will depend on your identity provider. Domino can integrate
via SAML with Okta, Azure AD, Ping, and any other provider that implements SAML v2.0.
The following are important properties of the SAML endpoint you will create in the provider. After the SAML
endpoint has been created and configured, you should export an XML metadata file you can use to complete the
configuration of the provider in Keycloak.

NameID policy format

Controls the format of the <saml2:NameID> element in the SAML Response. This will be used to derive the SSO
username in Domino.
• Option 1: urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress
– Users will be uniquely identified by their email and username will be automatically derived from
it
– Example:

7.4. Single Sign-On configuration 151


Domino Admin Docs Documentation, Release 4.4.0

<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-
˓→format:emailAddress">

john.smith@acme.org
</saml2:NameID>

• Option 2: urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified
– The SAML endpoint will need to respond with a string that can be used as the username of the
user without any modification
– Example:
<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-
˓→format:unspecified">

jsmith
</saml2:NameID>

• Option 3: urn:oasis:names:tc:SAML:1.1:nameid-format:persistent
– Typically the SAML endpoint will return a NameID that is a GUID which is not suitable for a username
– If the endpoint must use this format, then an additional attribute containing username must be returned

Assertion attributes

Additional SAML attributes are required to automatically populate the Domino profile. Without these, on first login
the user will be prompted to complete the required elements of their user profile.
The required attributes are:
• First Name
• Last Name
• Email
• Username (if NameId is not email or does not represent user name)
No specific attribute names are expected as these can be mapped in Keycloak.

Additional requirements

• Assertions signed - SAML Responses should contain signed assertions

152 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Import metadata and complete configuration in Keycloak

Import IdP Metadata

You can use the metadata file from the step above to complete configuration of the provider in Keycloak.
You can do this from the bottom of the identity provider configuration page. This is only available before the provider
is saved for the first time.

If you are importing from a file, make sure to click Import after selecting the file. After import, most of the provider
settings will be configured automatically. You can now save the configuration.

Additional settings

• Trust Email - Yes


Ensures that emails supplied from IdP are trusted even if Email Verification is enabled for DominoRealm.

• NameID Policy Format

7.4. Single Sign-On configuration 153


Domino Admin Docs Documentation, Release 4.4.0

This should have been configured on import, but verify that it matches the option configured on the external
endpoint.
• Want Assertions Signed - Yes
• Validate Signature - Yes
The corresponding signature field should already be populated based on the metadata you imported in the pre-
vious step

Additional options like Assertion Encryption and Request Singing are supported, but would require additional con-
figuration coordination between Keycloak and the endpoint in your identity provider.
For more detailed documentation of all supported SAML settings, see Keycloak SAML v2 Identity Providers

154 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Exporting metadata from Keycloak

Once the provider in Keycloak is saved, an Export tab will appear that contains XML metadata for the provider that
can be used to automatically configure the external endpoint.
The metadata will also be available at:
https://<deployment domain>/auth/realms/DominoRealm/broker/<alias>/endpoint/
descriptor

Configure attribute mappers

In order to make the experience of new users signing in for the first time seamless, and not require them to complete
their profile on initial login, you need to make sure that several SAML attributes are being passed back in SAML
responses and that these are correctly mapped to Domino user attributes.
If the attributes are not properly mapped, upon first login users will be prompted to complete missing fields in their
profile.

Mapping first name, last name, and email

To map these values from the SAML assertion attributes to the user profile model, you need to configure an Attribute
Importer mapper from the Mappers tab.

7.4. Single Sign-On configuration 155


Domino Admin Docs Documentation, Release 4.4.0

• First Name mapper


– Name: First Name
– Mapper Type: Attribute Importer
– Attribute Name: Name attribute for the <saml2:Attribute> element containing the value to be
mapped to First Name
– Friendly Name: FriendlyName attribute (optionally available) for the <saml2:Attribute> ele-
ment containing the value to be mapped to First Name
– User Attribute Name: Must be firstName
• Last Name mapper
– Name: Last Name
– Mapper Type: Attribute Importer
– Attribute Name: Name attribute for element containing the value for Last Name
– Friendly Name: FriendlyName attribute for the <saml2:Attribute> element containing the
value for Last Name
– User Attribute Name: Must be lastName
• Email mapper
– Name: Email
– Mapper Type: Attribute Importer
– Attribute Name: Name attribute for element containing the value for Email
– Friendly Name: FriendlyName attribute for the <saml2:Attribute> element containing the
value for Email
– User Attribute Name: Must be email
The following example illustrates how to map First Name from an assertion with the following payload:

<saml2:Attribute Name="customSAMLFirstName" FriendlyName="FriendlyFirstName">


<saml2:AttributeValue>John</saml2:AttributeValue>
</saml2:Attribute>

Can be mapped using:

156 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Alternatively can be mapped using:

Mapping username

The mapper configuration for username depends on how the external endpoint is configured with respect to NameID
Policy options.
• Option 1: urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress
– Use Email Prefix as UserName Importer
– Example:

<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress">
john.smith@acme.org
</saml2:NameID>

7.4. Single Sign-On configuration 157


Domino Admin Docs Documentation, Release 4.4.0

Map as shown:

• Option 2: urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified
– No need to do an importer. The username will be mapped automatically to the NameID value
– Example:

<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-
˓→format:unspecified">

jsmith
</saml2:NameID>

• Option 3: urn:oasis:names:tc:SAML:1.1:nameid-format:persistent
– Use Username Template Importer with Template of ${ATTRIBUTE.<attribute Name>}
or ${ATTRIBUTE.<attribute FriendlyName>}
– Example:

<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified">
jsmith
</saml2:NameID>
<saml2:Attribute Name="customUserName">
<saml2:AttributeValue>jsmith</saml2:AttributeValue>
</saml2:Attribute>

Map as shown:

158 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Attribute mapping documentation

For additional information on attribute mapping, please refer to keycloak documentation for Mapping Claims and
Assertions

Troubleshooting attribute mapping

When troubleshooting SAML attribute mapping, ideally you will want to have a specification for the SAML response
that your identity provider endpoint will send back to Keycloak following authentication. A thorough specification
will detail the NameID policy formate and attributes being sent in the response.
If such a specification is not available, or the attribute mapping does not function as expected, it may be necessary
to examine an actual SAML response that is returned after a login attempt. One simple way to do this is to use the
SAML-tracer extension available for Chrome and Firefox. It will allow you to examine decoded SAML requests and
responses. By examining a SAML response, you will be able to see the attributes that are being returned and verify
whether attributes are missing or the names or formats are different from what is expected.

Domino First Broker Login authentication flow

To configure the recommended login authentication flow, select the Domino First Broker Login flow (no
dashes):

7.4. Single Sign-On configuration 159


Domino Admin Docs Documentation, Release 4.4.0

See the Keycloak documentation for information on authentication flows.

Restrict access for SSO users to Domino

Typically, when configuring the SAML endpoint that will provide SSO authentication for Domino, the provider ad-
ministrator restricts the endpoint to a subset of users who should be allowed to authenticate through it. This is the
preferred method for restricting access to a subset of users with valid enterprise credentials.
In rare cases, where limitations in the provider software don’t allow you to constrain the set of users who can authen-
ticate against the endpoint, the provider will need to pass an additional SAML attribute which specifies if a user is
allowed to access Domino or not. The value of that attribute will depend on a specific rule for each user. Usually, it
will be based on membership in a particular group in your identity provider.
The following should be used as a last resort if all identity provider restriction options are exhausted.

Prerequisites

The Domino Keycloak instance must have keycloak.profile.feature.scripts=enabled.

160 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Expected SAML attributes

There must be an attribute that indicates whether a properly authenticated user should be allowed to log in to Domino.
• AttributeName:
– Suggested: rolesForDomino
– Could be anything as this can be mapped
• Multi-valued: Yes
• Value:
– Contains one or more values that could be used for gating access. Typically would be roles or groups.

Attribute mapper

You need to add an additional mapper to your provider configuration in Keycloak.


Use an Attribute Importer mapper type.
• Name: Allow in Domino
• Mapper Type: Attribute Importer
• Attribute Name: Name attribute for element containing the flag
• Friendly Name: FriendlyName attribute for element containing the groups for the user
• User Attribute Name: Must be accessforDomino
Example:

<saml2:Attribute Name="rolesForDomino">
<saml2:AttributeValue>dave-users</saml2:AttributeValue>
<saml2:AttributeValue>it-users</saml2:AttributeValue>
</saml2:Attribute>

Modifying the First Broker Login authentication flow

Before modifying the default Domino First Broker Login flow, you should first make a copy of it.

7.4. Single Sign-On configuration 161


Domino Admin Docs Documentation, Release 4.4.0

In the copy, add an execution of type Script.

Move the new Script entry to be immediately after the Create User If Unique execution

162 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Use the following script and modify the attribute value as needed.
/*
* Template for JavaScript based authenticator's.
* See org.keycloak.authentication.authenticators.browser.
˓→ScriptBasedAuthenticatorFactory

*/

// import enum for error lookup


AuthenticationFlowError = Java.type("org.keycloak.authentication.
˓→AuthenticationFlowError");

/**
* An example authenticate function.
(continues on next page)

7.4. Single Sign-On configuration 163


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


*
* The following variables are available for convenience:
* user - current user {@see org.keycloak.models.UserModel}
* realm - current realm {@see org.keycloak.models.RealmModel}
* session - current KeycloakSession {@see org.keycloak.models.KeycloakSession}
* httpRequest - current HttpRequest {@see org.jboss.resteasy.spi.HttpRequest}
* script - current script {@see org.keycloak.models.ScriptModel}
* authenticationSession - current authentication session {@see org.keycloak.sessions.
˓→AuthenticationSessionModel}

* LOG - current logger {@see org.jboss.logging.Logger}


*
* You one can extract current http request headers via:
* httpRequest.getHttpHeaders().getHeaderString("Forwarded")
*
* @param context {@see org.keycloak.authentication.AuthenticationFlowContext}
*/
function authenticate(context) {
//name of the attribute that needs to be 'true' to allow a user to authenticate
˓→in Domino

var requiredAttrName = "accessForDomino";


var requiredAttrMustContain = "dave-users";
var errorMessageId = "userNotAssignedToDominoInIdp";
var errorPageTemplate = "error.ftl";

if (user === null) {


context.success();
return;
}

LOG.info(script.name +
" trace script auth for: " +
user.username);

var requiredAttrValues = user.getAttribute(requiredAttrName)

LOG.info("User gated on attribute: " + requiredAttrName);


LOG.info("Attribute values from SSO: " + requiredAttrValues)
LOG.info("Attribute must contain: " + requiredAttrMustContain);

if (requiredAttrValues === null ||


requiredAttrValues.size() === 0 ||
requiredAttrValues.contains(requiredAttrMustContain)) {
//user is explicitly allowed in Domino
LOG.info("User is allowed in Domino.");
context.success();
return;
}

//user is not authorized to access Domino


LOG.info("User is not allowed in Domino.");
context.failure(AuthenticationFlowError.IDENTITY_PROVIDER_DISABLED, context.
˓→form().setError(errorMessageId, null).createForm(errorPageTemplate));

//actually remove the user that was provisionally created


session.userLocalStorage().removeUser(realm, user);
}

164 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Customizing the SSO button

When using the default domino-theme in Keycloak, each identity provider has a display text field that can be edited.
This display text will show up on the SSO button for that identity provider. If Display text is blank or equal to the
alias value, the button will display the default text Continue with Single Sign On. If any text other than the
value of the Alias field is used, that value becomes the text on the button.

Testing and troubleshooting

If you encounter errors from the Keycloak service while attempting an SSO login, you can view the Keycloak request
logs via kubectl by running kubectl -n <domino-platform-namespace> logs keycloak-0.

Session and token timeouts

By default, sessions are limited to 60 days but can be configured differently as needed.
See the Keycloak documentation for more information on timeouts.

7.4. Single Sign-On configuration 165


Domino Admin Docs Documentation, Release 4.4.0

7.4.2 AWS credential propagation

Overview

If you have enabled SSO for Domino, you can optionally configure AWS credential propagation, which allows for
Domino to automatically assume temporary credentials for AWS roles that are based on roles assigned to users in the
upstream identity provider. Below is a reference for the overall workflow from user login to credential usage.

Validations within the AssumeRoleWithSAML workflow

1. The Identity Provider Relying Party/Application validates that the Issuer element in the AuthnRequest
(SAML request) sent by Domino
2. Domino validates the Audience (Entity ID of the SP) in the SAML Response sent by the Identity Provider
Relying Party/Application

166 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

3. AWS AssumeRole validates that the Issuer of the SAML Response passed on from Domino matches the
Issuer of the Identity Provider Relying Party/Application. You can set up additional validations also (e.g.
validating the Audience)

Launching a Workspace or Run

Enable credential propagation in Domino

The following central configuration settings need to be set as shown to enable credential propagation. These can be
found or added by a Domino administrator by clicking Advanced > Central Config from the administration UI.
• Key: com.cerebro.domino.auth.aws.sts.enabled
Value: true
• Key: com.cerebro.domino.auth.aws.sts.region
Value: Short AWS region name where your Domino is deployed, such as us-west-2
• Key: com.cerebro.domino.auth.aws.sts.defaultSessionDuration

7.4. Single Sign-On configuration 167


Domino Admin Docs Documentation, Release 4.4.0

Value: Default session duration, such as 1h for 1 hour


Example of a valid configuration:

Remember to restart the services with the link at the top of the central configuration page for these settings to take
effect.

SAML provider configuration prerequisites

You need to have federation between your AWS account and your identity provider configured independent of Domino.
As an example see AWS Federated Authentication with Active Directory Federation Services (AD FS)
The SAML provider application connected to Domino needs include the appropriate AWS federation attributes based
on the roles that each user will be allowed to assume.
Since Domino will refresh the user’s credentials during an active session, you must ensure that any IAM role that you
propagate to a user has assume-self policy.
For example:

{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "<ARN for the role>"
}
}

Expected SAML attributes

• Attribute with Name https://aws.amazon.com/SAML/Attributes/Role


– Multi-valued: Yes
– Value format:

* Comma-separated key-value pair of provider and role

168 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

* <provider arn>,<role arn>


* arn:aws:iam::<acct #>:saml-provider/<provider name>,arn:aws:iam::<acct #>:role/<role name>
• Attribute with Name https://aws.amazon.com/SAML/Attributes/RoleSessionName
– Multi-valued: No
– Value:

* String to be used as identifier for the temporary credentials assumed


* Usually set to the email of the user
• Attribute with Name https://aws.amazon.com/SAML/Attributes/SessionDuration
– Multi-valued: No
– Value:

* Duration (in seconds) of how long the initial set of credentials for each of the roles is valid before the
user will need to login again

* The duration must be smaller than the maximum allowable duration for each of the roles made avail-
able for a given user
Before proceeding, it’s useful to check that your SAML attributes appeared in your SAML response when logging
into Domino. This will help validated that you’ve correctly established trust between AWS and your IDP. One simple
way to do this is to use the SAML-tracer extension available for Chrome and Firefox. It will allow you to examine
decoded SAML requests and responses to see that the appropriate attributes appear.
Example:

<saml2:AttributeStatement xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/Role">
<saml2:AttributeValue xsi:type="xs:string">
arn:aws:iam::123456789012:saml-provider/acme-saml,
˓→arn:aws:iam::123456789012:role/role1

</saml2:AttributeValue>
<saml2:AttributeValue xsi:type="xs:string">
arn:aws:iam::123456789012:saml-provider/acme-saml,
˓→arn:aws:iam::123456789012:role/role2

</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/RoleSessionName">
<saml2:AttributeValue xsi:type="xs:string">
john.smith@acme.org
</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/SessionDuration">
<saml2:AttributeValue xsi:type="xs:string">
900
</saml2:AttributeValue>
</saml2:Attribute>
</saml2:AttributeStatement>

7.4. Single Sign-On configuration 169


Domino Admin Docs Documentation, Release 4.4.0

Mapping AWS federation attributes

To map the appropriate values from the SAML assertion, you need to configure an Attribute Importer mapper from
the Mappers tab for the following attributes.

• AWS Roles
– Name: AWS Roles
– Mapper Type: Attribute Importer
– Attribute Name: https://aws.amazon.com/SAML/Attributes/Role
– Friendly Name: <blank>
– User Attribute Name: Must be aws-roles
• AWS Role Session Name
– Name: AWS Role Session Name
– Mapper Type: Attribute Importer
– Attribute Name: https://aws.amazon.com/SAML/Attributes/RoleSessionName
– Friendly Name: <blank>
– User Attribute Name: Must be aws-role-session-name
• AWS Session Duration
– Name: AWS Session Duration
– Mapper Type: Attribute Importer
– Attribute Name: https://aws.amazon.com/SAML/Attributes/SessionDuration
– Friendly Name: <blank>
– User Attribute Name: Must be aws-session-duration

170 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Additional provider configuration

In order to give Domino access to users SAML assertions, you need to enable the following settings from the identity
provider:
• Store Tokens: On
• Store Tokens Readable: On

Domino-client configuration

The domino-play OIDC client is pre-populated on installation with client mappers, so that IdP mapped SAML infor-
mation will flow into Domino.
1. Go to the Clients tab in the DominoRealm and select the domino-play client

7.4. Single Sign-On configuration 171


Domino Admin Docs Documentation, Release 4.4.0

2. Select the Mappers tab for the domino-play client

3. Below are the default domino-play client mappers:

172 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

1. Create a new mapper with type User Session Note and the following settings:
• Name: identity-provider-mapper
• Mapper Type: User Session Note
• User Session Note: identity_provider
• Token Claim Name: idpbroker
• Claim JSON Type: string
• Add to ID token: On
• Add to access token: On

7.4. Single Sign-On configuration 173


Domino Admin Docs Documentation, Release 4.4.0

Usage

Once configured properly the first time, you will need to log out and log back into Domino.
To confirm that credentials are propagating correctly to users, start a workspace and check the environment vari-
able AWS_SHARED_CREDENTIALS_FILE and that your credential file appears at /var/lib/domino/home/.
aws/credentials.
This should be sufficient for a user to connect to AWS resources without further configuration. Click here to see an
example of connecting to s3.
Learn more about using a credential file with AWS SDK.

Confirming the configuration

To test your configuration outside of Domino, perform an AssumeRoleWithSAML call successfully using the SAML
token provided to Domino by your IdP.
Example:

aws sts assume-role-with-saml \


--role-arn arn:aws:iam::521624712688:role/DataScientist-dev \
--principal-arn arn:aws:iam::521624712688:saml-provider/ADFS-DOMINO \
--saml-assertion "PHNhb.......VzcG9uc2U+"

174 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

7.4.3 SSO group and role synchronization

Domino supports synchronizing Domino administrative user roles and organization membership with attributes in your
SAML identity provider. This allows management of these roles and memberhsips to be externalized to the identity
provider.

SAML Group to Organization synchronization

Prerequisite

Your SAML provider application connected to Domino must include group membership as a multi-valued attribute.

Central configuration options

Enabling this feature requires that the following Domino central configuration setting is set as follows:
• Key: authentication.oidc.externalOrgsEnabled
Value: true
Remember that Domino services need to be restarted for this setting to take effect.

7.4. Single Sign-On configuration 175


Domino Admin Docs Documentation, Release 4.4.0

Attribute mapper

You need to add an additional mapper to the provider configuration in Keycloak.


Use an Attribute Importer mapper type.
• Name: Domino Groups
• Mapper Type: Attribute Importer
• Attribute Name: Name attribute for element containing the groups for the user
• Friendly Name: FriendlyName attribute for element containing the groups for the user
• User Attribute Name: Must be domino-groups
Example:

<saml2:Attribute Name="UserGroups">
<saml2:AttributeValue>nyc-data-scientists</saml2:AttributeValue>
<saml2:AttributeValue>all-data-scientists</saml2:AttributeValue>
<saml2:AttributeValue>sensitive-claims-users</saml2:AttributeValue>
</saml2:Attribute>

Domino client Mapper

By default, the domino-group-mapper client mapper is created upon installation. To review it, go to the Clients
tab in the DominoRealm in Keycloak, and select the domino-play client:

176 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Select the Mappers tab for the domino-play client

The domino-group-mapper mapper will be present in the default client mappers listed:

7.4. Single Sign-On configuration 177


Domino Admin Docs Documentation, Release 4.4.0

Role synchronization

In addition to automatically configuring group membership, it is also possible to automatically assign Domino admin-
istrative and/or user roles to users based on attributes from your SAML identity provider.

178 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Prerequisite

The SAML identity provider application connected to Domino must include attributes that can be mapped to specific
Domino roles.

Central configuration

Enabling this feature requires that the following Domino central configuration setting is set as follows:
• Key: authentication.oidc.externalRolesEnabled
Value: true
Remember that Domino services need to be restarted for this setting to take effect.

Attribute mapper

You need to add an additional mapper to the provider configuration in Keycloak.


Use an Attribute Importer mapper.
• Name: Domino System Roles
• Mapper Type: Attribute Importer
• Attribute Name: Name attribute for element containing the Domino system roles for the user
• Friendly Name: FriendlyName attribute for element containing the groups for the user
• User Attribute Name: Must be domino-system-roles

Domino client mapper

By default, the domino-system-roles client mapper is created upon installation. To review it, go to the Clients
tab in the DominoRealm in Keycloak and select the domino-play client.

7.4. Single Sign-On configuration 179


Domino Admin Docs Documentation, Release 4.4.0

Select the Mappers tab for the domino-play client

The domino-system-roles mapper will be present in the default client mappers listed:

180 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

7.4.4 Summary of Domino SAML attribute requirements

This section covers the SAML attributes expected by Domino to enable different pieces of functionality.

7.4. Single Sign-On configuration 181


Domino Admin Docs Documentation, Release 4.4.0

SSO attributes

The following are required to establish single sign-on between Domino and your identity provider:
• Username
– NameID (In Subject element)
– Preferred format: urn:oasis:names:tc:SAML:1.1:nameid-format:email
• First Name
– Attribute name: Can be any name since Domino allows attribute mapping
• Last Name
– Attribute name: Can be any name since Domino allows attribute mapping
• Email
– Attribute name: Can be any name since Domino allows attribute mapping
Example:

<saml2:Subject xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:email">
john.smith@acme.org
</saml2:NameID>
...
</saml2:Subject>
<saml2:AttributeStatement xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:Attribute Name="DominoEmail">
<saml2:AttributeValue xsi:type="xs:string">
john.smith@acme.org
</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="DominoFirstName">
<saml2:AttributeValue xsi:type="xs:string">
John
</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="DominoLastName">
<saml2:AttributeValue xsi:type="xs:string">
Smith
</saml2:AttributeValue>
</saml2:Attribute>
</saml2:AttributeStatement>

182 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Credential propagation attributes

The following attributes are optional but required if you are using the credential propagation functionality of Domino.
In this case, the following additional attributes are required.
• AWS Roles
– Attribute Name: https://aws.amazon.com/SAML/Attributes/Role
– Multi-valued: Yes
– Value format:

* Comma-separated key-value pair of provider and role


* <provider arn>,<role arn>
* arn:aws:iam::<acct #>:saml-provider/<provider name>,arn:aws:iam::<acct #>:role/<role name>
• AWS Role Session Name
– Attribute Name: https://aws.amazon.com/SAML/Attributes/RoleSessionName
– Multi-valued: No
– Value:

* String to be used as identifier for the temporary credentials assumed


* Usually set to the email of the user
• AWS Session Duration
– Attribute Name: https://aws.amazon.com/SAML/Attributes/SessionDuration
– Multi-valued: No
– Value:

* Duration (in seconds) of how long the initial set of credentials for each of the roles is valid before the
user will need to login again

* The duration must be smaller than the maximum allowable duration for each of the roles made avail-
able for a given user
Example:

<saml2:AttributeStatement xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/Role">
<saml2:AttributeValue xsi:type="xs:string">
arn:aws:iam::123456789012:saml-provider/acme-saml,
˓→arn:aws:iam::123456789012:role/role1

</saml2:AttributeValue>
<saml2:AttributeValue xsi:type="xs:string">
arn:aws:iam::123456789012:saml-provider/acme-saml,
˓→arn:aws:iam::123456789012:role/role2

</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/RoleSessionName">
<saml2:AttributeValue xsi:type="xs:string">
john.smith@acme.org
</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/SessionDuration">
(continues on next page)

7.4. Single Sign-On configuration 183


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


<saml2:AttributeValue xsi:type="xs:string">
900
</saml2:AttributeValue>
</saml2:Attribute>
</saml2:AttributeStatement>

Group synchronization attributes

The following attributes are required if you are using group synchronization functionality in Domino.
In this case, the following additional attributes are required:
• Domino Organizations
– Name: Can be any name since Domino can do attribute mapping
– Multi-valued: Yes
– Values:

* One or more of the groups of which the user is a member in your centralized identity provider. For
any groups specified here, the user will be automatically enrolled in a Domino organization with the
same name
Example:

<saml2:AttributeStatement xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:Attribute Name="DominoOrganizations">
<saml2:AttributeValue>nyc-data-scientists</saml2:AttributeValue>
<saml2:AttributeValue>all-data-scientists</saml2:AttributeValue>
<saml2:AttributeValue>sensitive-claims-users</saml2:AttributeValue>
</saml2:Attribute>
</saml2:AttributeStatement>

184 Chapter 7. Keycloak authentication service


Domino Admin Docs Documentation, Release 4.4.0

Administrative role synchronization attributes

The following attributes are required if you are using administrative role synchronization functionality in Domino.
In this case, the following additional attributes are required:
• Domino System Roles
– Name: Can be any name since Domino can do attribute mapping
– Multi-valued: Yes
– Values:

* One or more values that is an exact, case-sensitive match to one of the Domino administrative roles
· Practitioner
· SysAdmin
· Librarian
· ReadOnlySupportStaff
· SupportStaff
· ProjectManager
Example:

<saml2:AttributeStatement xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:Attribute Name="DominoSystemRoles">
<saml2:AttributeValue xsi:type="xs:string">
SysAdmin
</saml2:AttributeValue>
<saml2:AttributeValue xsi:type="xs:string">
Librarian
</saml2:AttributeValue>
</saml2:Attribute>
</saml2:AttributeStatement>

7.4. Single Sign-On configuration 185


Domino Admin Docs Documentation, Release 4.4.0

186 Chapter 7. Keycloak authentication service


CHAPTER 8

Operations

This section contains information for IT and site reliability operations on how to measure, understand, and manage the
health of a deployed Domino application.
Domino runs in Kubernetes, which is an orchestration framework for containerized applications. In this model there
are three distinct layers with their own relevant metrics:
1. Domino application
This is the top layer, representing Domino application components running in containers that are deployed and
managed by Kubernetes. The content in this guide focuses on operations in this layer.
2. Kubernetes cluster
This is the Kubernetes software-defined hardware abstraction and orchestration system that manages the deploy-
ment and lifecycle of Domino application components. Cluster operations are handled a layer below Domino,
but do need to take into account the Domino architecture and cluster requirements. For guidance on general
cluster administration, consult the official Kubernetes documentation.
3. Host infrastructure
This is the bottom layer, representing the virtual or physical host machines that are doing work as nodes in the
Kubernetes cluster. Operations in this layer, including management of computing and storage resources as well
as OS patching, are the responsibility of the IT owners of the infrastructure. Domino does not have any unique
or unusual requirements in this layer.

8.1 Domino application logging

There are two types of logs produced by the operation of Domino.


1. Domino execution logs
2. Domino application logs

187
Domino Admin Docs Documentation, Release 4.4.0

8.1.1 Execution logs

These are the logs output by user code running in Domino as a Job, Workspace, App, or Model API. These are
available in the Domino web application on the Jobs Dashboard, Workspaces Dashboard, App Dashboard, and Model
API instance logs. This data is a key part of the Domino reproducibility model, and is kept indefinitely in the Domino
blob store.
The system these logs are written to is defined in the installation configuration file at blob_storage.logs.

8.1.2 Application logs

All Domino services output their logs using the standard Kubernetes logging architecture. Relevant logs are printed to
stdout or stderr as indicated, and are captured by Kubernetes.
For example, to look at your front end logs you could do the following:
1. List your all namespaces to find the name of you platform namespace
kubectl get namespace
2. List all the pods in your platform namespace to find the name of a front end. Keep in mind you likely have more
than one front end pod.
kubectl get pods -n <namespace for you platform nodes>
3. Print the front ends logs for one of your front ends
kubectl logs <pod name of your front end pod> -n <namespace for you platform
nodes> -c nucleus-frontend
The most effective way to aggregate logs is to attach a Kubernetes log aggregation utility to monitor the following
Kubernetes namespaces used by Domino:
• Platform namespace
This namespace hosts the core application components of the Domino application, including API servers,
databases, and web interfaces. The name of this namespace is defined in the installer configuration file at
namespaces.platform.name.
The following components running in this namespace produce the most important logs:

188 Chapter 8. Operations


Domino Admin Docs Documentation, Release 4.4.0

Component Logs
nucleus-frontend The nucleus-frontend pods host the frontend API server that routes all
requests to the Domino application. Its logs will contain details on
HTTP requests to Domino from the application or another API client.
If you see errors in Domino with HTTP error codes like 500, 504, or
401, you can find corresponding logs here.
nucleus-dispatcher The nucleus-dispatcher pod hosts the Domino scheduling and brokering
service that sends user execution pods to Kubernetes for deployment.
Errors in communication between Domino and Kubernetes will result
in corresponding logs from this service.
keycloak The keycloak pods hosts the Domino authentication service. The logs
for this service will contain a record of authentication events, including
additional details on any errors.
cluster-autoscaler This pod hosts the open-source Kubernetes cluster autoscaler, which
controls and manages autoscaling resources. The logs for this service
will contain records of scaling events, both scaling up new nodes in re-
sponse to demand and scaling down idle resources, including additional
details on any errors.

• Compute grid namespace


This namespace hosts user executions plus Domino environment builds. The name of this namespace is defined
in the installer configuration file at namespaces.compute.name.
Logs that appear in this namespace will correspond to ephemeral pods hosting using work. Each pod will contain
a user-defined environment container, whose logs are described above as Execution logs. There are additional
supporting containers in those pods, and their logs may contain additional information on any errors or behavior
seen with specific Domino executions.
Users are advised to aggregate and keep at least 30 days of logs to facilitate debugging. These logs can be harvested
with a variety of Kubernetes log aggregation utilities, including:
• Loggly
• Splunk
• NewRelic

8.2 Domino monitoring

Monitoring Domino involves tracking several key application metrics. These metrics reveal the health of the applica-
tion and can provide advance warning of any issues or failures of Domino components.

8.2. Domino monitoring 189


Domino Admin Docs Documentation, Release 4.4.0

8.2.1 Metrics

Domino recommends tracking these metrics in priority order:

Metric Suggested threshold Notes


Latency to /health 1000ms Measures the time to receive a response to a re-
quest to the Domino API server. If the response
time is too high, this suggests that the system is
unhealthy and that user experience might be im-
pacted. This can be measured by calls to the
Domino application at a path of /health.
Dispatcher pod avail- nucleus-dispatcherIf the number of pods in the
ability from metrics pods available = 0 for nucleus-dispatcher deployment is 0
server > 10 minutes for greater than 10 minutes, its an indication of
critical issues that Domino will not automatically
recover from, and functionality will be degraded.
Frontend pod availabil- nucleus-frontend If the number of pods in the
ity from metrics server pods available < 2 for nucleus-frontend deployment is less
> 10 minutes than two for greater than 10 minutes, its an
indication of critical issues that Domino will not
automatically recover from, and functionality
will be degraded.

There are many application monitoring tools you can use to track these metrics, including:
• NewRelic
• Splunk
• Datadog

8.2.2 Alerting

Users are advised to configure alerts to their application administrators if the thresholds listed above are exceeded.
These alerts are an indication of potential resourcing issues or unusual usage patterns worth investigation. Refer to
the Domino application logs, the Domino administration UI, and the Domino Control Center to gather additional
information.

8.3 Sizing infrastructure for Domino

Domino runs in Kubernetes, which is an orchestration framework for delivering applications to a distributed compute
cluster. The Domino application runs two types of workloads in Kubernetes, and there are different principles to sizing
infrastructure for each:

190 Chapter 8. Operations


Domino Admin Docs Documentation, Release 4.4.0

• Domino Platform
These always-on components provide user interfaces, the Domino API server, orchestration, metadata and sup-
porting services. The standard architecture runs the platform on a stable set of three nodes for high availability,
and the capabilities of the platform are principally managed through vertical scaling, which means changing
the CPU and memory resources available on those platform nodes and changing the resources requested by the
platform components.
• Domino Compute
These on-demand components run users’ data science, engineering, and machine learning workflows. Compute
workloads run on customizable collections of nodes organized into node pools. The number of these nodes can
be variable and elastic, and the capabilities are principally managed through horizontal scaling, which means
changing the number of nodes. However, when there are more resources present on compute nodes, they can
handle additional workloads, and therefore there are benefits to vertical scaling.

8.3.1 Sizing the Domino Platform

The resources available to the Domino Platform will determine how much concurrent work the application can handle.
This is the primary capability of Domino that is limited by vertical scale. To increase the capacity, key components
must have access to additional CPU and memory.
The default size for the Domino Platform is three nodes, with 8 CPU cores and 32GB memory each, for a total of
24 CPU cores and 96GB of memory. Those resources are available to the collective of Platform services, and each
service claims some resources via Kubernetes resource requests.
The capabilities of that default size are shown below, along with options for alternative sizing.

Size Maximum concur- Platform specs


rent executions
Default 300 3 nodes with at least 8 CPU cores and 32 GiB memory each.
AWS recommendation: 3x m5.2xlarge
GCP recommendation: 3x n1-standard-8
Azure recommendation: 3x Standard_DS5_v2
Other Contact your Varies
Domino account
team if you need an
alternative size

8.3. Sizing infrastructure for Domino 191


Domino Admin Docs Documentation, Release 4.4.0

Estimating concurrent executions

Domino recommends assuming a baseline maximum number of workloads equal to 50% of the number of total
Domino users, expressed as a _concurrency_ of 50%. However, different teams and organizations may have dif-
ferent usage patterns in Domino. For teams that regularly run batches of many executions at once, it may be necessary
to size Domino to support a concurrency of 100%, or even 200%.

Optimizing your configuration for efficient use of Platform resources

The following practices can maximize the capabilities of a Platform with a given size.
• Cache frequently used Domino environments in the AMI used for your Compute Nodes. This reduces load on
the Platform Docker registry.
• Optimize your hardware tiers and node sizes to fit many workloads in tidy groups. Each additional node runs
message brokers, logging agents, and adds load to Platform services that process queues from the Compute
Grid. The Platform can handle more concurrent executions by running more executions on fewer nodes.
• Parallelize your tasks by running your workload on many cores of one large node, rather than by chunking tasks
into multiple workloads across multiple nodes. This reduces the total number of nodes being managed, and
thereby reduces load on the Domino platform.

192 Chapter 8. Operations


CHAPTER 9

Data management

9.1 Data in Domino

• Overview
• About Domino project files
– How is the data in project files stored?
– Who can access the data in project files?
• About Domino Datasets
– How is the data in Domino Datasets stored?
– Who can access the data in Domino Datasets?
• Integrating Domino with other data stores and databases
• Tracking and auditing data interactions in Domino

9.1.1 Overview

This article describes how Domino stores and handles data that users upload, import, or create in Domino. There are
two systems that store data in Domino:

193
Domino Admin Docs Documentation, Release 4.4.0

• Domino project files


• Domino Datasets
Additionally, Domino supports connecting to many external data stores. Users can import data from external stores
into Domino, export data from Domino to external stores, or run code in Domino that reads and writes from external
stores without saving data in Domino itself.

9.1.2 About Domino project files

How is the data in project files stored?

Work in Domino happens in projects. Every Domino project has a corresponding collection of project files. While at
rest, project files are stored in a durable object storage system, referred to as the Domino Blob Store. This can be a
cloud service like Amazon S3, or it can be an on-premises Network Attached Storage (NAS) system.
When a user starts a Run in Domino, the files from his or her project are fetched from the Blob Store and loaded
into the Run in the working directory of the Domino service filesystem. When the Run finishes, or the user initiates
a manual sync in an interactive Workspace session, any changes to the contents of the working directory are written
back to Domino as a new revision of the project files. Domino’s versioning system tracks file-level changes and can
provide rich file difference information between revisions.
Domino also has several features that provide users with easy paths to quickly initiating a file sync. The following
events in Domino can trigger a file sync, and the subsequent creation of a new revision of a project’s files.
• User uploads files from the Domino web application upload interface
• User authors or edits a file in the Domino web application file editor
• User syncs their local files to Domino from the Domino Command Line Interface
• User uploads files to Domino via the Domino API
• User executes code in a Domino Job that writes files to the working directory
• User writes files to the working directory during an interactive Workspace session, and then initiates a manual
sync or chooses to commit those files when the session finishes
All revisions of project files that Domino creates are kept forever, since project files are a component in the Domino
Reproducibility Engine. It is always possible to return to and work with past revisions of project files.
While users are generally unable to permanently delete data from Domino project files, administrators do have the
capability to delete specific files by directly editing the contents of the blob store. This is an invasive process and not
recommended for day-to-day activity.

194 Chapter 9. Data management


Domino Admin Docs Documentation, Release 4.4.0

Who can access the data in project files?

Users can read and write files to the projects they create, on which they automatically are granted an Owner role.
Owners can add collaborators to their projects with the following additional roles and associated files permissions.
• Contributor
Can read and write project files.
• Results Consumer
Can read project files.
• Launcher User
Cannot access project files.
• Project Importer
Can access files made available for export.
The permissions available to each role are described in more detail in Sharing and collaboration.
Users can also inherit roles from membership in Domino Organizations. Learn more in the Organizations overview.
Domino users with administrative roles are granted additional access to project files across the Domino deployment
they administer. Learn more in Admin roles.

9.1.3 About Domino Datasets

How is the data in Domino Datasets stored?

When users have large quantities of data, including collections of many files and large individual files, Domino rec-
ommends storing the data in a Domino Dataset. Datasets are collections of Snapshots, where each Snapshot is an
immutable image of a filesystem directory from the time when the Snapshot was created.
These directories are stored in a network filesystem like Amazon EFS or a local NFS, and can be attached to Domino
Runs for read-only use without transferring their contents into the Domino service filesystem. This allows users to
quickly start working on big data in Domino.
Each Snapshot of a Domino Dataset is an independent state, and its membership in a Dataset is an organizational
convenience for working on, sharing, and permissioning related data. Domino supports running scheduled Jobs that
create Snapshots, enabling users to write or import data into a Dataset as part of an ongoing pipeline.
Unlike project files, Dataset Snapshots can be permanently deleted by Domino system administrators. Snapshot
deletion is designed as a two-step process to avoid data loss, where users mark Snapshots they believe can be deleted,
and admins then confirm the deletion if appropriate. This permanent deletion capability makes Datasets the right
choice for storing data in Domino that has regulatory requirements for expiration.

9.1. Data in Domino 195


Domino Admin Docs Documentation, Release 4.4.0

Who can access the data in Domino Datasets?

Datasets in Domino belong to projects, and access is afforded accordingly to users who have been granted roles on
the containing project. Owners can mount Snapshots from Datasets in the project for read access, they can write new
Snapshots, and they can add collaborators with the following roles.
• Contributor
Can mount Datasets for read access and write new Snapshots.
• Results Consumer
Cannot read from Datasets or write new Snapshots.
• Launcher User
Cannot read from Datasets or write new Snapshots.
• Project Importer
Can mount Datasets for read access.
The permissions available to each role are described in more detail in Sharing and collaboration.
Users can also inherit roles from membership in Domino Organizations. Learn more in the Organizations overview.
Domino users with administrative roles are granted additional access to Datasets across the Domino deployment they
administer. Learn more in Admin roles.

9.1.4 Integrating Domino with other data stores and databases

Domino can be configured to connect to external data stores and databases. This process involves loading the re-
quired client software and drivers for the external service into a Domino environment, and loading any credentials or
connection details into Domino environment variables. Users can then interact with the external service in their Runs.
Users can import data from the external service into their project files by writing the data to the working directory of
the Domino service filesystem, and they can write data from the external service to Dataset Snapshots. Alternatively, it
is possible to construct workflows in Domino that save no data to Domino itself, but instead pull data from an external
service, do work on the data, then push it to an external service.
Learn more in the Data sources overview and read our detailed Data source connection guides.

196 Chapter 9. Data management


Domino Admin Docs Documentation, Release 4.4.0

9.1.5 Tracking and auditing data interactions in Domino

Domino system administrators can set up audit logs for user activity in the platform. These logs record events whenever
users:
• Create files
• Edit files
• Upload files
• View files
• Sync file changes from a Run
• Mount Dataset Snapshots
• Write Dataset Snapshots
This list is not exhaustive, and will expand as Domino adds new features and capabilities.
Domino administrators can contact support@dominodatalab.comfor assistance enabling, accessing, and processing
these logs.

9.2 Data flow in Domino

There are three ways for data to flow in and out of a Domino Run.

9.2.1 1) Domino File Store

Each Domino Run takes place in a project, and the files for the active revision of the project are automatically loaded
into the local execution volume for a Job or Workspace according to the specifications of the Domino Service Filesys-
tem. These files are retrieved from the Domino File Store, and any changes to these files are written back to the
Domino File Store as a new revision of the project’s files.

9.2.2 2) Domino Datasets

Domino Runs may optionally be configured to mount Domino Datasets for input or output. Datasets are network
volumes mounted in the execution environment. Mounting an input Dataset allows for a Job or Workspace to both
start quickly and have access to large quantities of data, since the data is not transferred to the local execution volume
until user code performs read operations from the mounted volume. Any data written to an output Dataset is saved by
Domino as a new snapshot.

9.2.3 3) External data systems

User code running in Domino can use third party drivers and packages to interact with any external databases, APIs,
and file systems that the Domino-hosting cluster can connect to. Users can read and write from these external systems,
and they can import data into Domino from such systems by saving files to their project or writing files to an output
Dataset.

9.2. Data flow in Domino 197


Domino Admin Docs Documentation, Release 4.4.0

The diagram below shows the series of operations that happens when a user starts a Job or Workspace in Domino, and
illustrates when and how various data systems can be used.

9.3 External Data Volumes

• Overview
• Setting up Kubernetes PV and PVC
• Registering external data volumes
• Viewing registered external data volume details
• Editing registered external data volumes
• Unregistering external data volumes
• Configuring censorship

9.3.1 Overview

You can access the External Data Volumes (EDV) administration screen by going to the Domino administration page
and navigating to External Data Volumes: Data -> External Volumes

198 Chapter 9. Data management


Domino Admin Docs Documentation, Release 4.4.0

External data volumes must be registered with Domino before they can be used. All registered external data volumes
appear in a standard table, which display the EDV name, type, description, and volume access (see Volume Properties).
In addition, for each registered EDV, the Projects column indicate which projects had added the EDV.

Unless otherwise specified, all the following actions assume you are in the EDV administration page.

9.3.2 Setting up Kubernetes PV and PVC

Note: We assume the set up of Kubernetes persistent volumes (PV) and persistent volume claims (PVC) is done by a
Kubernetes administrator.

Domino runs on a Kubernetes cluster and EDVs must be backed by an underlying Kubernetes persistent volume (PV).
More importantly, that persistent volume must be bounded to a properly labelled persistent volume claim (PVC). Here
is an example PV yaml file:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-nfs
spec:
accessModes:
(continues on next page)

9.3. External Data Volumes 199


Domino Admin Docs Documentation, Release 4.4.0

(continued from previous page)


- ReadWriteMany
capacity:
storage: 30Gi
nfs:
path: /mnt/export
server: 10.0.0.26
persistentVolumeReclaimPolicy: Retain

The creation of the PVC must include the label with a key dominodatalab.com/external-data-volume.
The value of that key represents the type of external data volume. Currently, NFS is the only supported value. Finally,
the PVC must be created in the Domino compute namespace. Here is an example PVC yaml file:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-nfs
namespace: default
labels:
"dominodatalab.com/external-data-volume": "NFS"
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 30Gi
volumeName: pv-nfs

All properly labelled PVCs will be available candidates to register in the Domino EDV administration user interface.

9.3.3 Registering external data volumes

To register an EDV with Domino, click the Register External Volume button on the upper right hand size of the EDV
administration page. This will open a modal with the EDV registration wizard. The wizard will guide administrators
to registering the EDV by configuring various EDV properties (see Volume Properties).
1. Volume
The first step in the wizard is to select the volume type. Currently, NFS is only supported volume type.
The Available Volumes list will show all candidate volumes of the selected type. The name of these volumes is
the name of the backing Kubernetes persistent volume claim (PVC).

200 Chapter 9. Data management


Domino Admin Docs Documentation, Release 4.4.0

2. Configuration
The second step in the wizard is to configure the volume.

9.3. External Data Volumes 201


Domino Admin Docs Documentation, Release 4.4.0

• Name. (Required). This field will default to the selected PVC name that was selected, but can be changeed.
A good practice is to name EDV such that it is recognized by users based on the supporting use case or
some organization defined convention.
• Mount Path. (Required). This specifies the relative mount path for the EDV for supported executions.
This field will default to the selected PVC name that was selected, but can be changeed. This field must
be unique to all registered EDVs. There are a few reserved words. See Volume Properties.
• Mount as read-only. This checkbox specifies the mount type—whether the EDV is mounted by as read-
only or read-write. Default is read-only (checked). Note that this is enforced at the Domino layer. More
restrictive access controls at the Kubernetes or NFS layer overrule this setting. For example, if the PVC
access mode is set to read only, it does not matter this field allows for read-write; the underlying permission
of read only will be enforced.
• Description. Admin defined description for EDV.
3. Access
The third step in the wizard is to define the volume access. See Volume Properties and Authorization.
• Everyone. Allow EDV access to all logged-in users.
• Specific users or organizations. Limit EDV access to specific users and organizations.

202 Chapter 9. Data management


Domino Admin Docs Documentation, Release 4.4.0

Note: Regardless of the setting here, Domino Administrators (SysAdmin) will always be able to access any external
data volume.

9.3.4 Viewing registered external data volume details

To view a registered EDV details, click on the Name of the EDV in the admin table.

9.3. External Data Volumes 203


Domino Admin Docs Documentation, Release 4.4.0

9.3.5 Editing registered external data volumes

To edit the details of a registered EDV, click on the vertical three dots on the right-hand side of its entry in the admin
EDV table. This will expose the Edit action. Click Edit to edit the EDV details.

204 Chapter 9. Data management


Domino Admin Docs Documentation, Release 4.4.0

A modal with editable fields appear where users can change EDV properties.

9.3. External Data Volumes 205


Domino Admin Docs Documentation, Release 4.4.0

9.3.6 Unregistering external data volumes

To unregister an EDV, click on the vertical three dots on the right-hand side of its entry in the admin EDV table. This
will expose the Unregister action. Click Unregister to unregister the EDV.

206 Chapter 9. Data management


Domino Admin Docs Documentation, Release 4.4.0

A confirmation modal appears where users can confirm the unregistration by clicking Unregister, or cancel out of the
operation altogether by clicking Cancel.

9.3. External Data Volumes 207


Domino Admin Docs Documentation, Release 4.4.0

9.3.7 Configuring censorship

Multiple users collaborating on the same project may not all have the same level of volume access. EDVs added to the
project should not be accessible to users without volume access, and under no circumstance will a user without volume
access to an EDV be able to mount that EDV in a supported execution. However, we offer options to manage the
visibility of the EDV in the user interface with two levels of censorship. The levels of censorship allow administrators
to choose between security and discoverability needs.
• Full censorship. Only the existence of any inaccessibe EDV is made known to the user; the quantity and any
metadata (such as name or description) is not made known to the user. This is the level for those that want the
highest level of security.
• Inactive censorship. Inaccessible EDVs are made known to the user; the EDV metadata (such as name and
description) is made known to the user. This is the level that promotes discoverability. With discoverability,
users cana escalate to Domino administrators to gain volume access. This is the default level of censorship.
The level of censorship is configured by a feature flag: ShortLived.ExternalDataVolumesFullCensor.

Key: ShortLived.ExternalDataVolumesFullCensor
Value: boolean
Default: false

208 Chapter 9. Data management


Domino Admin Docs Documentation, Release 4.4.0

When this is true, the censorship level is full censorship.


When this is false, the censorship level is inactive censorship.

9.4 Datasets administration

• Overview
• Accessing the Datasets administration interface
• Monitoring Datasets usage
• Setting limits on Datasets usage
• Deleting Snapshots from Datasets

9.4.1 Overview

Domino administrators have four important responsibilities when managing Domino Datasets:
1. periodically check the Datasets administration interface
2. monitor and track storage consumption
3. set limits on usage per-Dataset
4. handle deletion of Dataset snapshots

9.4. Datasets administration 209


Domino Admin Docs Documentation, Release 4.4.0

9.4.2 Accessing the Datasets administration interface

To access the Datasets administration interface, click Admin from the Domino main menu to open the Admin home,
then click Advanced > Datasets.

9.4.3 Monitoring Datasets usage

The Datasets administration page shows important information about Datasets usage in your deployment. At the top
of the interface is a display that shows:
• total storage size used by all stored Snapshots
• the size of all storage used by Snapshots marked for deletion
Below that display is a table of all Snapshots from the history of the deployment. This table can be sorted by Snapshot
status, size, and the name of the containing Dataset.

210 Chapter 9. Data management


Domino Admin Docs Documentation, Release 4.4.0

9.4.4 Setting limits on Datasets usage

There are two important central configuration options administrators can use to limit the growth of storage consump-
tion by Datasets.

9.4. Datasets administration 211


Domino Admin Docs Documentation, Release 4.4.0

Namespace: common
Key: com.cerebro.domino.dataset.quota.maxActiveSnapshotsPerDataset
Value: number
Default: 20
This option controls the maximum number of active Snapshots that may
be stored in a Dataset. Snapshots marked for deletion are not active
and do not count against this limit.

Namespace: common
Key: com.cerebro.domino.dataset.quota.maxStoredSnapshotsPerDataset
Value: number
Default: 20
This option controls the total number of Snapshots of any status that
may be stored in a Dataset.

If a Dataset reaches one of these limits, attempting to start a run with a Dataset configuration that could output a new
Snapshot will result in an error message. Before additional Snapshots can be written, you will need to delete old
snapshots or increase the limit.
Administrators can authorize individual projects to ignore these limits with an option in the Hardware & environment
tab of the project settings.

9.4.5 Deleting Snapshots from Datasets

Administrators can delete individual Snapshots at any time with the Delete button at the end of the row representing
the Snapshot in the Datasets administration UI. Clicking this button will open a confirmation dialog, and if you choose
to confirm, the Snapshot will be permanently deleted.

212 Chapter 9. Data management


Domino Admin Docs Documentation, Release 4.4.0

To avoid losing user data, Domino recommends following a two-step process for Snapshot deletion, where the user
who owns the Dataset marks a Snapshot for deletion, and then an administrator takes action to delete the Snapshot if
reasonable. Non-administrator users can never permanently delete Snapshots on their own.
From the Datasets administration UI, you’ll find a button you can click to Delete all marked snapshots, and you can
also sort the table of Snapshots by status to find and examine all Snapshots that have been marked for deletion.

9.5 Submitting GDPR requests

Domino Data Lab is able to assist its customers in their obligations as data controllers under GDPR. This article covers
how to submit requests to Domino, and the information required from the customer to action the request. Because
Domino does not systematically access the contents of files uploaded, requests either need to reference specific users,
or specific files.
1. User Deletion Domino is able to purge personal data about the name, email address and IP address of users of
Domino if required. To process a request, you will need to provide:
• The user account name
• A substitute user to inherit any owned projects and files
2. File versions request Domino is able to provide the hash of all files in a version chain, and optionally access to
those files as well. To process a request, you will need to provide:
• A text file with the username, project name, and file path in this format: user-
name/file_path_1/file_name.csv.
3. List of projects referencing a file Domino is able to provide the list of all projects which reference a specific file
version in Domino. This is useful to identify potential impacts of changing a source file or version. To process
a request, you will need to provide:
• A text file with the username, project name, and file path in this format: user-
name/file_path_1/file_name.csv.
4. File deletion or substitution request To process a request, you will need to have provided:
• A text file with the username, project name, and file path needing deletion or substitution in this format:
username/file_path_1/file_name.csv.

9.5. Submitting GDPR requests 213


Domino Admin Docs Documentation, Release 4.4.0

• A text file with the username, project name, and file path to substitute (if applicable) in this format: user-
name/file_path_1/file_name.csv.
After any GDPR request, Domino will provide the customer evidence of the actions carried out but will no longer be
able to see the data in the system.
Most customers don’t encounter a need to have data deleted or returned through the course of doing business. Should
you need to for GDPR or other reasons, note that this may impact your history for reproducibility or auditability.
Domino does not accept any responsibility for identifying derived data from files, nor ensuring the stability of projects
or other work referencing a user or file altered in a request.

214 Chapter 9. Data management


CHAPTER 10

User management

10.1 Roles

Administrator’s of Domino can assign roles to users. These roles can be set manually via the UI or they can be mapped
in from your identity provider if you have SSO integration enabled.
The available roles are:
• Practitioner
• SysAdmin
• ProjectManager
• Librarian
• SupportStaff
• ReadOnlySupportStaff
Users with no role are treated as a LightUser, have restricted feature access, and may have a different licensing status.
A SysAdmin user can grant access roles to other users. To do so, open Users tab of the admin UI. Locate the user you
want to grant permissions to, click Edit next to the username, then select the desired role.
Users can have more than one role, and will have the additive permissions of each role.
By default, all new users will be assigned the Practitioner role, but this can be changed with central configuration
options.

215
Domino Admin Docs Documentation, Release 4.4.0

10.1.1 Project Overview Actions

Permission LightUser Practitioner SysAdmin


Create Project X
View Project List X X X
Fork Project X

10.1.2 File Actions

Permission LightUser Practitioner SysAdmin


List and View Files X X X
Edit Files X
Upload Files X

10.1.3 Workspace Actions

Permission LightUser Practitioner SysAdmin


Start Workspace X
Stop Workspace X X
Open Workspace X
View Workspace History X X X
Archive Workspace X X

10.1.4 Job Actions

Permission LightUser Practitioner SysAdmin


Start Job X
Stop Job X X
View Job History X X X
Create Scheduled Job X
Edit Scheduled Job X X
Delete Scheduled Job X X

216 Chapter 10. User management


Domino Admin Docs Documentation, Release 4.4.0

10.1.5 Project Settings Actions

Permission LightUser Practitioner SysAdmin


View Project Settings X X
Edit Project Settings X X

10.1.6 Model API Actions

Permission LightUser Practitioner SysAdmin


Create Model API X X
Be a Model API “Owner” X X
Be a Model API “Editor” X X X
Be a Model API “Viewer” X X

10.1.7 App Actions

Permission LightUser Practitioner SysAdmin


Publish or Start App X X
Stop App X X X
View App X X X

10.1.8 Launcher Actions

Permission LightUser Practitioner SysAdmin


View Launchers X X X
Create or Edit Launcher X X
Delete Launcher X X
Run Launcher X X

10.1. Roles 217


Domino Admin Docs Documentation, Release 4.4.0

10.1.9 Dataset Actions

Permission LightUser Practitioner SysAdmin


Create Dataset X
Create Dataset Snapshot X
Mount Dataset X
View Datasets X X X
Delete Dataset Snapshot X

10.1.10 Environment Actions

Permission LightUser Practitioner SysAdmin


List and View Environment X X X
Create Environment X X X
Edit Environment X X X

10.1.11 Administrator Actions

Permission LightUser Practitioner SysAdmin


View Admin UI X
Edit Settings in Admin UI X

About the Project Manager Role

When Project Managers are members of organizations, their role grants them owner-level access to all projects that
are owned by other members of the organizations. This allows the Project Manager to see these projects and their
assets in the Projects Portfolio and Assets Portfolio.
Note that the Project Manager may also have the ability to add users to these organizations, thereby gaining contributor
access to those users’ projects. For this reason, Project Manager should be treated as a highly privileged role, similar
to System Administrator.

218 Chapter 10. User management


Domino Admin Docs Documentation, Release 4.4.0

10.2 License usage reporting

• Overview
• Tracking user license types
• Generating user activity reports

10.2.1 Overview

Administrators can use configurable thresholds to track user behavior across the platform for the purposes of identi-
fying users who are taking up a Domino license. Users who access Domino only to consume data science products,
view results, and run Launchers are not counted as taking up a practitioner license.
Once a user performs a data science workflow like starting a Run or publishing a Model, the user will be considered a
practitioner for the purposes of licensing.

10.2.2 Tracking user license types

To view user information and identify users who are taking up a license, open the Admin interface by clicking Admin
at the bottom of the main menu, then click Users.

10.2. License usage reporting 219


Domino Admin Docs Documentation, Release 4.4.0

From this interface, admins can see:


• What license type a user is assigned
• How many data science practitioner workloads a user has run
• The most recent activity for a user
This allows admins to identify inactive users who are taking up a practitioner license, and there is an option in this
interface to free up the license by deactivating the user.

10.2.3 Generating user activity reports

The same data on license types, practitioner workloads, and recent activity that is shown in the Users table is available
as a downloadable CSV report. To generate a report manually, from the Admin interface click Advanced > User
Activity Report.

Admins can specify the following parameters for the report:


• A date range they want data from
• How far back to set the threshold for including actions in the recent activity section
• A specific project or organization to get data about

220 Chapter 10. User management


Domino Admin Docs Documentation, Release 4.4.0

• Email addresses to receive copies of the report


It’s also possible to configure Domino to send User Activity Reports on a regular cadence. To set this up, click
Advanced > Central Config from the Admin interface, then set the following options.

Namespace: common
Key: com.cerebro.domino.Usage.ReportRecipients
Value: comma-separated list of email addresses to receive automated reports
Default: empty

Namespace: common
Key: com.cerebro.domino.Usage.RecentUsageDays
Value: number of days back to set as the threshold for recent activity
Default: 30

Namespace: common
Key: com.cerebro.domino.Usage.ReportFrequency
Value: cron string for how often to send usage reports
Default: 0 0 2 * * ? (daily at 02:00)

10.2. License usage reporting 221


Domino Admin Docs Documentation, Release 4.4.0

222 Chapter 10. User management


CHAPTER 11

Environments

11.1 Environment management best practices

• Overview
• Best practices
• How to clean up your catalog of environments
– Look at current environment usage
– Plan changes to global environments
– Sunset deprecated and unused environments

11.1.1 Overview

This document covers best practices for compute environment management. As a Domino admin, you will have the
power and responsibility to curate the environments used by your organization. A proactive approach to environment
management can prevent sprawl, avoid repetition in environment creation, and equip users with the tools they need to
succeed in Domino.
As an admin, your objective is to find a balance between giving users the freedom to be agile in development, while
also maintaining enough control that you don’t end up with duplicate or unnecessary environments. Admins and users

223
Domino Admin Docs Documentation, Release 4.4.0

are able to create arbitrary number of environments in Domino. You’ll want to manage their creation so that you don’t
end up with dozens of global environments and hundreds of user environments, which can make it hard for users to
know which environments to use, and hard for you as an admin to maintain.
Don’t let your Domino look like this:

11.1.2 Best practices

Domino recommends admins follow these best practices:


1. Try to keep as few global environments as possible.
In our experience, the benefits of focusing on a small number of global environments with broad applications,
outweighs the benefit of creating many niche global environments. When a user requests an environment with
new capabilities, consider whether you can add their requested features to an existing global environment instead
of creating a new one.
2. Use clear, descriptive names for your environments.
A common cause of environment duplication is a user who cannot tell that an existing environment meets his
or her needs. Clear, descriptive names on all environments will make the whole catalog comprehensible to new
users or admins, and makes Domino easier to work with and maintain.

224 Chapter 11. Environments


Domino Admin Docs Documentation, Release 4.4.0

3. Add comments to the Dockerfile for each environment.


You can add comments to a Dockerfile like this:

# This is a comment that could provide a helpful description of the code to follow
RUN echo 'This is an executed Dockerfile instruction'

# Here's a block that installs Ruby


RUN \
apt-get update && \
apt-get install -y ruby

Do yourself and future colleagues a favor by investing in a well-commented Dockerfile. Each section should a
have clear heading and comments to explain its purpose and implementation.
4. Share responsibility for environment management
If you have multiple teams or departments doing separate work in Domino, they should be responsible for
maintaining their own team-specific environments. Find an advanced user in each team and make him or her
a deputy for environment management. This person should be responsible for planning and understanding the
environments his or her team needs, and should work with you on implementation. This reduces the workload
of the admins, and ensures that environments are designed by someone with context on what users need.
5. Keep global images up-to-date and comprehensive
You should strive to have global images that cover the majority of users’ needs. Users should only need to make
minor additions to global environments when creating their own user environments, such as installing a specific
version of a package. You don’t want a situation where users are re-installing Python or making other major
changes, as this will result in a bloated and poorly performing environment.
6. Avoid time-consuming image pulls by caching global environments on the executor machine image
You should cache your global environments in your executor template machine image. This ensures that each
new executor starts up with the base Docker image for any environment already cached. If users are setting up
environments that have base images very different from what is cached on the machine image, it can lead to long
pull times when launching executors. Contact Domino Support for help with modifying your machine image.
7. Clean up old or poorly maintained environments
Create a culture of tidiness around environment creation and management. Enforce a standard of quality in nam-
ing and Dockerfile commenting, and be assertive about pruning unnecessary environments. See the following
section of this document for a walkthrough.

11.1.3 How to clean up your catalog of environments

Over time, it’s inevitable that the number of environments in your Domino will grow. It’s valuable to do an occasional
review to weed out unused environments, update active ones, and consolidate where possible. Depending on the size
of your organization and your use of Domino, this may be a yearly or quarterly task.

11.1. Environment management best practices 225


Domino Admin Docs Documentation, Release 4.4.0

Look at current environment usage

As an admin, you can see all environments being used across your deployment in the Environments Overview. The
table of environments on this screen can be sorted by the # of Projects column to get a quick understanding of
which environments are in common use. You can also enter global in the search box to filter for global environments.

Click the name of an environment to see Dockerfile details, as well as the list of projects and models using that
environment. The list will also include the date of the last run for each project and model. Keep an eye out for
user environments that make duplicate changes to global base environments, as well as unused or poorly-maintained
environments.

Plan changes to global environments

Based on what you learned by reviewing existing environments, you should plan an updated set of global environments
that include the tools and features frequently added by users. In same cases, it might be as simple as adding a few
packages to an existing global environment. You can also create a new global environment when necessary, but we
recommend erring on the side of larger, more consolidated environments. Doing so will make it easier for your users
to choose an environment, and it will be easier for you to manage and maintain the collection of environments in your
deployment.
Before executing your plan and changing the available global environments, it’s best to inform your users of the
impending changes and solicit their feedback. Explain the changes to existing environments, announce the creation of
new ones, and provide recommendations for which environment to use for various types of projects.

Sunset deprecated and unused environments

As an admin, you have the power to archive any environment. All old projects will still be able to use an archived
environment, but new projects won’t be able to select it. Historical runs will still reference an archived environment,
so archiving never breaks reproducibility. Use archiving to encourage adoption of new, up-to-date, and consolidated
environments. Environments can be un-archived at any time.

226 Chapter 11. Environments


Domino Admin Docs Documentation, Release 4.4.0

11.2 Caching environment images in EKS

When a user launches a Domino Run, part of the start-up process is loading the user’s environment onto the node that
will host the Run. For large images, the process of transferring the image to a new node can take several minutes.
Once an image has been loaded onto a node once, it gets cached, and future Runs that use the same environment will
start up faster.
When running Domino on EKS, you can pre-cache popular environments and base images on the Amazon Machine
Image (AMI) used for new nodes. This can speed up the start time of Runs on new nodes significantly. This page
describes the process of creating a new AMI with cached environments and configuring EKS to use it for new nodes.

11.2.1 AMI requirements

In addition to any dependencies required by Kubernetes itself, your AMI should contain the following:
• Docker
• Cache of Domino’s compute environments
• Nvidia-Docker 2 (GPU nodes only)
• Nvidia GPU driver 410+ (GPU nodes only)
• Change the default docker runtime (GPU nodes only)
For simplicity, recommends that you use the official EKS default AMIs, which come pre-configured with Docker and
the GPU tools.
• Click to read about the official EKS AMI Domino recommends for default compute nodes
• Click to read about the official EKS AMI Domino recommends for GPU nodes
Alternatively, you can use Amazon’s build scripts to create your own AMI for use with EKS.

11.2.2 AMI operations

The following sections describe how to perform several important types of operations on an EC2 instance to set it up
as the template for a new AMI suitable for Domino.

11.2. Caching environment images in EKS 227


Domino Admin Docs Documentation, Release 4.4.0

Install Docker

Read the official instructions on how to install Docker.

Pull environment images

Pre-caching environment images is a simple process of running docker pull for the base images those environ-
ments are built on, or the built environments from the internal registry itself.
To pull the Domino Standard Environment base images, your command would look like this, substituting in the version
string for the image you want to cache.

docker pull quay.io/domino/base:<desired version>

To pull a built image from the Domino internal registry, you will need to find its URI from the Revisions tab in the
environment details page.

For example, to cache revision #9 of the environment shown in the screenshot above, you would run:

docker pull 100.97.56.113:5000/domino-5d7abf2715f3690007f23081:9

Install NVIDIA Docker 2.0 (GPU AMIs only)

Read the official instrctions for installing the nvidia-docker 2.0 runtime.

228 Chapter 11. Environments


Domino Admin Docs Documentation, Release 4.4.0

Install GPU drivers (GPU AMIs only)

To use the GPU on a GPU node, you need to install the appropriate driver on the machine image. Domino does not
have a requirement for any specific driver version, however, if you want to use a Domino Standard Environment, it
should be a version that is compatible with the current version of Cuda shown in standard environments.
Click to view a compatibility matrix.
If you’d like to install the GPU drivers manually, you can follow these instructions.
To validate that your GPU machine is configured properly, reboot the machine and run the following:

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

This will show the driver number and GPU devices if installed successfully.

Change the default Docker runtime (GPU AMIs only)

Read the official instructions from NVIDIA on using the container runtime.
Note that you must restart Docker before this will work.

11.2.3 Complete AMI caching procedure

1. Determine which AMI you want to use as the base for the new AMI. If you’re performing this operation on an
operational Domino node pool, you should use the AMI that’s currently used in the active launch configuration.

Once you’ve identified the name of the active launch configuration, view its details to see the AMI ID it uses.

11.2. Caching environment images in EKS 229


Domino Admin Docs Documentation, Release 4.4.0

2. Launch a new EC2 instance from the base AMI.


3. Connect to the instance via SSH and perform any of the operations listed above that you want to apply to your
new AMI, including pulling any environment images you want to cache.
4. Snap a new AMI from the EC2 instance.
5. Create a copy of the launch configuration currently used by any ASGs you want to switch to using the new AMI.
6. Edit the AMI for the copied launch configuration to be the ID of the new AMI you snapped.
7. For any ASGs that you want to start using the new AMI, switch them over to the new launch configuration.
Once you complete the final step, any ASGs you switched to using the new launch configuration will start using the
new AMI whenever they create new nodes. These new nodes will therefore have any environment images you pulled
onto the AMI template cached, and will be fast to start new Domino Runs.

230 Chapter 11. Environments


CHAPTER 12

Disaster recovery

12.1 Backing up Domino

• Domino backups in AWS


• Domino backups on-premises

12.1.1 Domino backups in AWS

The following systems are canonical stores of critical Domino data, and they are stored and backed up in AWS as
described below.

231
Domino Admin Docs Documentation, Release 4.4.0

Sys- Purpose Store Backup


tem
ProjectsStores the contents of Stored in S3 in the Relies on the inherent durability and auto-
store users’ project files. bucket specified in the mated replication and backups of S3.
installer configuration at
blob_storage.projects.s3.bucket.
Log Stores execution logs Stored in S3 in the bucket speci- Relies on the inherent durability and auto-
his- from Domino jobs and fied in the installer configuration mated replication and backups of S3.
tory workspaces. at blob_storage.logs.s3.bucket.
DockerStores Docker images Stored in S3 in the bucket Relies on the inherent durability and auto-
reg- built by Domino to specified in the installer mated replication and backups of S3.
istry back users’ Domino en- configuration at inter-
vironments and Model nal_docker_registry.s3_override.bucket.
APIs.
Datasets
Stores the contents Stored in EFS in the file Relies on the inherent durability and auto-
of users’ Domino system specified in the mated replication and backups of EFS
Datasets. installer config at stor-
age_classes.shared.efs.filesystem_id.
Mon- Stores Domino appli- Stored in EBS-backed Kuber- Domino performs a daily backup that writes
goDB cation object data and netes persistent volumes at- an archive containing all MongoDB data to
metadata. tached to pods running the S3 in the bucket specified in the installer con-
HA MongoDB service in the figuration at blob_storage.backups.s3.bucket.
Domino cluster.
DominoStores Domino project Stored in EBS-backed Ku- Domino performs a daily backup that writes
Git version history. bernetes persistent volumes an archive containing all Git data to S3 in the
attached to pod running the bucket specified in the installer configuration
Domino Git service in the at blob_storage.backups.s3.bucket.
Domino cluster.
Post- Stores information on Stored in EBS-backed Kuber- Domino performs a daily backup that writes
gres users managed by the netes persistent volumes at- an archive containing all Postgres data to S3
Keycloak authentica- tached to pods running the HA in the bucket specified in the installer config-
tion service. Postgres service in the Domino uration at blob_storage.backups.s3.bucket.
cluster.

12.1.2 Domino backups on-premises

The following systems are canonical stores of critical Domino data, and they are stored and backed on-premises as
described below. These methods can also be applied to other clouds for which Domino does not have native storage
integrations.

232 Chapter 12. Disaster recovery


Domino Admin Docs Documentation, Release 4.4.0

Sys- Purpose Store Backup


tem
ProjectsStores the Stored in the shared Domino does not automatically back up this data. Domino
store contents of storage class defined at recommends configuring a separate backup process that saves
users’ project storage_classes.shared or syncs the entire contents of the NFS mount path provided to
files. and referenced at Domino to another storage system. The volume of project data
blob_storage.projects. will grow linearly over time as users produce new projects,
The most common back- so you should account for the growing volume when setting
ing storage for this class backup frequency and retention policies.
is NFS.
Log Stores exe- Stored in the shared This is stored in the same volume as project data, and is there-
his- cution logs storage class defined at fore on the same filesystem path in the shared storage system.
tory from Domino storage_classes.shared Any backups of the root path provided to Domino will include
jobs and and referenced at log data.
workspaces. blob_storage.logs
DockerStores Docker Stored in a block Domino does not automatically back up this data. Domino
reg- images built storage volume provi- recommends configuring a separate backup process that saves
istry by Domino to sioned from the storage or syncs the entire contents of the NFS mount path provided to
back users’ class defined at stor- Domino to another storage system. The volume of project data
Domino envi- age_classes.block and will grow linearly over time as users produce new projects,
ronments and mounted in the Docker so you should account for the growing volume when setting
Model APIs. registry container. backup frequency and retention policies.
Datasets
Stores the Stored in the shared stor- This is stored in the same volume as project data, and is there-
contents age class defined at stor- fore on the same filesystem path in the shared storage system.
of users’ age_classes.shared. Any backups of the root path provided to Domino will include
Domino log data.
Datasets.
Mon- Stores Stored in a block Domino performs a daily backup that writes an archive
goDB Domino storage volume provi- containing all MongoDB data to the shared storage
application sioned from the storage class defined at storage_classes.shared and referenced at
object data class defined at stor- blob_storage.backups.
and metadata. age_classes.block and
mounted in the Mongo
containers.
Domino
Stores Stored in a block Domino performs a daily backup that writes an archive con-
Git Domino storage volume provi- taining all Git data to the shared storage class defined at stor-
project ver- sioned from the storage age_classes.shared and referenced at blob_storage.backups.
sion history. class defined at stor-
age_classes.block and
mounted in the Git
container.
Post- Stores in- Stored in a block Domino performs a daily backup that writes an archive
gres formation storage volume provi- containing all Postgres data to the shared storage class
on users sioned from the storage defined at storage_classes.shared and referenced at
managed by class defined at stor- blob_storage.backups.
the Keycloak age_classes.block and
authentication mounted in the Postgres
service. containers.

12.1. Backing up Domino 233


Domino Admin Docs Documentation, Release 4.4.0

234 Chapter 12. Disaster recovery


CHAPTER 13

Control Center

13.1 Control Center overview

• Overview
• Who can access the Control Center?
• How do I open the Control Center?
• What metrics are available in the Control Center?
• Drilling down for more details
• Control center hardware tier page
• Control center project page
• Control center user page

13.1.1 Overview

The Control Center displays important data about your Domino deployment. From the Control Center, you can view
deployment-wide usage of compute resources by hours of runtime or spend in USD. You can also drill down into

235
Domino Admin Docs Documentation, Release 4.4.0

detailed statistics on projects, hardware tiers, and users. The control center data is also available for export if you’d
like to create your own reports or analysis.

13.1.2 Who can access the Control Center?

At this time, only Domino Admins can view the Control Center. The Control Center shows detailed deployment-wide
statistics and granular data on users and projects, and its functionality depends on the user having Admin permissions.
If you need access to the Control Center, contact your local Domino Administrator or email sup-
port@dominodatalab.com.

236 Chapter 13. Control Center


Domino Admin Docs Documentation, Release 4.4.0

13.1.3 How do I open the Control Center?

If you have access to the Control Center, you’ll find a link to it in the Switch To menu.

13.1.4 What metrics are available in the Control Center?

When you first open the Control Center, you’ll see a bar chart of deployment compute spend in USD for each day in
the current month.

About compute spend

13.1. Control Center overview 237


Domino Admin Docs Documentation, Release 4.4.0

Compute spend is based on settings applied by admins when creating and managing hardware tiers. Compute spend
data will only be available in the Control Center if the “cents per minute” property is set on the hardware tier in use.
These numbers also only represent active usage, and do not reflect other potential spend like idle cloud resources or
storage.
You can change the date range shown with the dropdown menu in the upper right, and you can switch the chart to
display compute usage by hours of runtime with the dropdown in the upper left.
Below the deployment-wide chart is a panel that displays more granular data on projects, users, and hardware tiers
across the selected date range. You can chart these by the following metrics:
• Projects can be charted by compute spend (USD) or compute hours
• Users can be charted by compute spend (USD) or compute hours
• Hardware tiers can be charted by average run queue time in minutes

This chart will display the top five results for the chosen metric. When you have the chart set to display data on users,
there will also be a View all link you can use to load a paginated table with detailed usage statistics for all users.

238 Chapter 13. Control Center


Domino Admin Docs Documentation, Release 4.4.0

By default this table will show data for the date range that was set on the previous page. There’s a dropdown menu in
the top right you can use to change the date range if desired.

13.1.5 Drilling down for more details

Many of the tables and charts support drilling down for more detail on a specific item. Clicking on one of the bars in
a Control Center bar chart to see an expanded and detailed page on the related project, user, or hardware tier.
Some of these pages will also display a table of related runs. You can click an entry in a Run Logs table to view the
specified run in the project Runs UI.

13.1. Control Center overview 239


Domino Admin Docs Documentation, Release 4.4.0

13.1.6 Control center hardware tier page

This page shows performance averages for runs that use the specified hardware tier, and tracks completed runs. Details
on all runs performed on the specified hardware tier are listed in the Run Logs table. Click an entry in the table to
view the specified run in the project Runs UI.

13.1.7 Control center project page

This page breaks down project spend across Apps, Batch Runs, Endpoints, Launchers, Scheduled Runs, and
Workspaces. All runs executed in the project are detailed in the Run Logs table. Click an entry in the table to
view the specified run in the project Runs UI.

240 Chapter 13. Control Center


Domino Admin Docs Documentation, Release 4.4.0

13.1. Control Center overview 241


Domino Admin Docs Documentation, Release 4.4.0

13.1.8 Control center user page

This page shows detailed data on a user’s activity in Domino. The top of the page has charts showing the types of runs
this user starts, which projects the user works in, and which hardware tiers the user uses. You can click on bars in the
project and hardware tier charts to view the object represented. All runs started by this user are detailed in the Run
Logs table. Click an entry in the table to view the specified run in the project Runs UI.

242 Chapter 13. Control Center


Domino Admin Docs Documentation, Release 4.4.0

13.2 Exporting Control Center data with the API

The Control Center interface in Domino provides many different views on deployment usage, broken down by hard-
ware tier, project, or user. However, if you want to do a more detailed, custom analysis, it’s possible for Domino
administrators to use the API to export Control Center data for examination with Domino’s data science features or
external business intelligence applications.
The endpoint that serves this data is /v4/gateway/runs/getByBatchId.
Click through to read the REST documentation on this endpoint, or see below for a detailed description plus examples.

13.2.1 API keys

To make an API call, you’ll need the API key for your account. In this case, accessing the full deployment’s Control
Center data requires that you use an admin account. Once you’re logged in as an admin, click your username at bottom
left, then click Account Settings.

Click API Key from the settings menu to link down to the API Key panel. Copy the displayed key and keep it handy.
You’ll need it to make requests to the API.

Note that anyone bearing this key could authenticate to the Domino API as you. Treat it like a sensitive password.

13.2.2 Using the data gateway endpoint

Here’s a basic call to the data export endpoint, executed with cURL:

curl --include \
-H "X-Domino-Api-Key: <your-api-key>" \
'https://<your-domino-url>/v4/gateway/runs/getByBatchId'

By default, the endpoint starts with the oldest available run data, beginning from January 1st, 2018. Older data is not
available. The command also has a default limit of 1000 runs worth of data. As written, the call above will return data
on the oldest 1000 runs available.

13.2. Exporting Control Center data with the API 243


Domino Admin Docs Documentation, Release 4.4.0

To try out this example, fill in <your-api-key> and <your-domino-url> in the command above.
The standard JSON response object you receive will have the following scheme:

{
"runs": [
{
"batchId": "string",
"runId": "string",
"title": "string",
"command": "string",
"status": "string",
"runType": "string",
"userName": "string",
"userId": "string",
"projectOwnerName": "string",
"projectOwnerId": "string",
"projectName": "string",
"projectId": "string",
"runDurationSec": 0,
"hardwareTier": "string",
"hardwareTierCostCurrency": "string",
"hardwareTierCostAmount": 0,
"queuedTime": 0,
"startTime": 0,
"endTime": 0,
"totalCostCurrency": "string",
"totalCostAmount": 0
}
],
"nextBatchId": "string"
}

Each run recorded by the Control Center gets a batchId, which is an incrementing field that can be used as a
cursor to fetch data in multiple batches. You can see in the response above, after the array of runs objects there is a
nextBatchId parameter that points to the next run that would have been included.
You can use that ID as a query parameter in a subsequent request to get the next batch:

curl --include \
-H "X-Domino-Api-Key: <your-api-key>" \
'https://<your-domino-url>/v4/gateway/runs/getByBatchId?batchId=<your-batchId-here>'

You can also request the data as CSV by including a header with Accept: text/csv. On the Unix shell, you
can write the response to a file with the > operator. This is a quick way to get data suitable for import into analysis
tools:

curl --include \
-H "X-Domino-Api-Key: <your-api-key>" \
-H 'Accept: text/csv' \
'https://<your-domino-url>/v4/gateway/runs/getByBatchId' > your_file.csv

13.2.3 Example: Getting all data

The below code shows a simple Python script that fetches all Control Center data from the earliest available to a
configurable end date, and writes it to a CSV file. Fill in the date of the last known completed run to fetch all available
historical data.

244 Chapter 13. Control Center


Domino Admin Docs Documentation, Release 4.4.0

import requests
import json
import pandas as pd
import os
from datetime import datetime
from datetime import timedelta

URL = "https://<your-domino-url>/v4/gateway/runs/getByBatchId"
headers = {'X-Domino-Api-Key': '<your-api-key>'}
last_date = 'YYYY-MM-DD'

last_date = datetime.strftime(datetime.strptime(last_date, '%Y-%m-%d') +


˓→timedelta(days = 1), '%Y-%m-%d')

try:
os.remove('output.csv')
except:
pass

batch_ID_param = ""
while True:
batch = requests.get(url = URL + batch_ID_param, headers = headers)
parsed = json.loads(batch.text)
batch_ID_param = "?batchId=" + parsed['nextBatchId']
df = pd.DataFrame(parsed['runs'])
df[df.endTime <= last_date].to_csv('output.csv', mode = "a+", index = False,
˓→header = True)

if len(df.index) < 1000 or len(df.index) > len(df[df.endTime <= last_date].index):


break

Running a script like this periodically allows you to easily import fresh data into your tools for custom analysis. You
can work with the data in a Domino project, or make it available to third party tools like Tableau:

13.2. Exporting Control Center data with the API 245

You might also like