Professional Documents
Culture Documents
Release 4.4.0
1 About Domino 4 3
2 Architecture 5
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 User accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Service mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Kubernetes 11
3.1 Cluster requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Requirements checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Domino on EKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Domino on GKE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Domino on AKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Domino on OpenShift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7 NVIDIA DGX in Domino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8 Domino in Multi-Tenant Kubernetes Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.9 Encryption in transit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.10 Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Installation 61
4.1 Installation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Configuration Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Installer configuration examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Private or offline installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 fleetcommand-agent release notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 Configuration 101
5.1 Central Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Change the default project for new users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3 Project stage configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4 Domino integration with Atlassian Jira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6 Compute 121
6.1 Managing the Compute Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Hardware Tier best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
i
6.3 Model resource quotas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.4 Persistent volume management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5 Adding a node pool to your Domino cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.6 Removing a node from service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8 Operations 187
8.1 Domino application logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.2 Domino monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.3 Sizing infrastructure for Domino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
11 Environments 223
11.1 Environment management best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
11.2 Caching environment images in EKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
ii
Domino Admin Docs Documentation, Release 4.4.0
This guide describes how to install, operate, administer, and configure the Domino application in your own Kubernetes
cluster. This content is applicable to Domino users with self-installation licenses.
If you are interested in running Domino as a managed service in your cloud or in a single-tenant vendor cloud, contact
Domino. Managed service customers will have installation, operations, and administration handled via professional
services, and the content of this guide will not be required or applicable.
Contents 1
Domino Admin Docs Documentation, Release 4.4.0
2 Contents
CHAPTER 1
About Domino 4
Domino is a data science platform that enables fast, reproducible, and collaborative work on data products like models,
dashboards, and data pipelines. Users can run regular jobs, launch interactive notebook sessions, view vital metrics,
share work with collaborators, and communicate with their colleagues in the Domino web application.
All Domino components run in Kubernetes. You can run an instance of Domino in the cloud or on-premises in your
office or data center.
Use the links in the sidebar to learn more about the Domino architecture and supported Kubernetes clus-
ter configurations. If you need help setting up a Domino-compatible Kubernetes cluster, send an email to
sales@dominodatalab.com.
3
Domino Admin Docs Documentation, Release 4.4.0
Architecture
5
Domino Admin Docs Documentation, Release 4.4.0
2.1 Overview
2.2 Services
Domino services are best understood when arranged into logical layers based on function and communication. A
description of the functionality provided by each layer follows.
6 Chapter 2. Architecture
Domino Admin Docs Documentation, Release 4.4.0
The client layer contains the Frontend pods that are the targets of a network load balancer. Domino users can access
Domino’s core features by connecting to the Frontends via:
• Web browser, in which case the Frontend serves the Domino application
• HTTPS request to the Domino API, which the Frontend routes to the API server
• Domino CLI, which uses the API
The Frontends run on platform nodes.
The service layer contains the Domino API server, Dispatcher, Keycloak authentication service, and the metadata
services that Domino uses to provide reproducibility and collaboration features. MongoDB stores application object
metadata, Git manages code and file versioning, Elasticsearch powers in-app search, and the Docker registry is used
by Domino Environments. Project data, logs, and backups are written to durable blob storage.
All of these services run on platform nodes.
The service layer also contains the dedicated master nodes for the Kubernetes cluster.
The execution layer is where Domino will launch and manage ephemeral pods that run user workloads. These may
host Jobs, Model APIs, Apps, Workspaces, and Docker image builds.
These run on compute nodes.
2.2. Services 7
Domino Admin Docs Documentation, Release 4.4.0
2.3 Software
The following primary application services run on platform nodes in the Domino Kubernetes cluster.
• nginx
nginx is an open source HTTP and reverse proxy server. Domino uses NGINX to serve the Domino web
application and as a reverse proxy to route requests to internal services.
Learn more about nginx
• Domino API server
The Domino application exposes the Domino API and handles REST API requests from the web application
and user clients.
• Domino dispatcher
The Domino dispatcher handles orchestration of workloads on compute nodes. The dispatcher launches new
compute pods, connects results telemetry back to the Domino application, and monitors the health of running
workloads.
• Keycloak
Keycloak is an enterprise-grade open source authentication service. Domino uses Keycloak to store user iden-
tities and properties, and optionally for identity brokering or identity federation to SSO systems and identity
providers.
Keycloak supports the following protocols:
– SAML v2.0
– OpenID Connect v1.0
– OAuth v2.0
– LDAP(S)
Learn more about Keycloak
8 Chapter 2. Architecture
Domino Admin Docs Documentation, Release 4.4.0
• MongoDB
MongoDB is an open source document database. Domino uses MongoDB to store Domino entities, like projects,
users, and organizations. Domino stores the structure of these entities in MongoDB, but underlying data is stored
separately in encrypted blob storage.
Learn more about MongoDB
• Git
Git is a free and open source distributed version control system. Domino uses Git internally for revisioning
projects and files. Domino Executors also run Git clients, and they can interact with user-controlled external
repositories to access code or data.
Learn more about Git
• Elasticsearch
Elasticsearch is a distributed, RESTful search and analytics engine. Domino uses Elasticsearch to power user
searches for Domino objects like projects, files, and models. Domino also uses Elasticsearch for logging.
Learn more about Elasticsearch
• Docker registry
The Docker registry is an application used to store and distribute Docker images. Domino uses its registry
to store images for Domino environments and Model APIs. These images are built to user specifications by
compute nodes.
Learn more about Docker registry
• Fluentd
Fluentd is an open source application that unifies and processes logging and telemetry data. Domino uses
Fluentd to aggregate logs and forward data to durable storage.
Learn more about Fluentd
• Redis
Redis is an open source data structure cache. Domino uses Redis to cache logs in-memory for streaming back
to users through the web application.
Learn more about Redis
• RabbitMQ
RabbitMQ is an open source message broker. Domino uses RabbitMQ as an event bus to asynchronously
distribute event messages between Domino services.
Learn more about RabbitMQ
• Postgres
Postgres is an open source relational database system. Domino uses Postgres as a storage system for Keycloak
data on user identities and attributes.
Learn more about Postgres
2.3. Software 9
Domino Admin Docs Documentation, Release 4.4.0
Domino uses Keycloak to manage user accounts. Keycloak supports the following modes of authentication to Domino.
When using local accounts, anyone with network access to the Domino application may create a Domino account.
Users supply a username, password, and email address on the signup page to create a Domino-managed account.
Domino administrators can track, manage, and deactivate these accounts through the application. Domino can be
configured with multi-factor authentication and password requirements through Keycloak.
Learn more about Keycloak administration
Keycloak can be configured to integrate with an Active Directory (AD) or LDAP(S) identity provider (IdP). When
identity federation is enabled, local account creation is disabled and Keycloak will authenticate users against identities
in the external IdP and retrieve configurable properties about those users for Domino usernames and email addresses.
Learn more about Keycloak identity federation
Keycloak can be configured to broker authentication between Domino and an external authentication or SSO system.
When identity brokering is enabled, Domino will redirect users in the authentication flow to a SAML, OAuth, or OIDC
service for authentication. Following authentication in the external service, the user is routed back to Domino with a
token containing user properties.
Learn more about Keycloak identity brokering
A service mesh provides a transparent and language-independent way to flexibly and easily automate application
network functions, such as: traffic routing, load balancing, observability, and encryption. Domino can optionally
deploy or integrate with Istio, an open source service mesh. We require Istio 1.7.2+. Istio is required to implement
intra-cluster encryption in transit.
Learn more about Istio
10 Chapter 2. Architecture
CHAPTER 3
Kubernetes
Domino 4 runs in your Kubernetes cluster, and the infrastructure can be managed with Kubernetes native tools like
kubectl.
You can deploy Domino 4 into a Kubernetes cluster that meets the following requirements.
• Kubernetes 1.13+
• Cluster permissions
Domino needs permission to install and configure pods in the cluster via Helm. The Domino installer is delivered
as a containerized Python utility that operates Helm through a kubeconfig that provides service account access
to the cluster.
• Three namespaces
Domino creates three dedicated namespaces, one for Platform nodes, one for Compute nodes, and one for
installer metadata and secrets.
11
Domino Admin Docs Documentation, Release 4.4.0
Storage classes
In GCP, compute engine persistent disks are used to back this storage class. Consult this example configuration
for a compatible GCEPD storage class:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: dominodisk
parameters:
replication-type: none
type: pd-standard
provisioner: kubernetes.io/gce-pd
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
12 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
Native
For shared storage, we allow for (and even require) native cloud provider object store for a few resources and services:
• Blob Storage. For AWS, the blob storage must be backed by S3 (see Blob storage). For other infrastructure,
the dominoshared storage class is used.
• Logs. For AWS, the log storage must be backed by S3 (see Blob storage). For others, the dominoshared
storage class is used.
• Backups. For all supported cloud providers, storage for backups are backed by the native blob store. For
on-prem, backups is backed by the dominoshared storage class.
– AWS: S3
– Azure: Azure Blob Storage
– GCP: GCP Cloud Storage
• Datasets. For AWS, Datasets storage must be backed by EFS (see Datasets storage). For other infrastructure,
the dominoshared storage class is used.
On-Prem
In on-prem environments, both dominodisk and dominoshared can be backed by NFS. In some cases, host
volumes can be used (and even preferred). Host volumes are preferred for the Git, Postgres, and MongoDB. Postgres
and MongoDB provide state replication. Host volumes can be used for Runs, but not preferred since we want leverage
files cached in block storage that can move between nodes. If host volumes are used for Runs, file caching should be
disabled and you will potentially expect slow start up executions for large Projects.
Summary
Domino requires a minimum of two node pools, one to host the Domino Platform and one to host Compute workloads.
Additional optional pools can be added to provide specialized execution hardware for some Compute workloads.
1. Platform pool requirements
• Boot Disk: 128GB
• Min Nodes: 3
• Max Nodes: 3
• Spec: 8 CPU / 32GB
• Labels: dominodatalab.com/node-pool: platform
• Tags:
– kubernetes.io/cluster/{{ cluster_name }}: owned
– k8s.io/cluster-autoscaler/enabled: true #Optional for autodiscovery
– k8s.io/cluster-autoscaler/{{ cluster_name }}: owned #Optional for autodis-
covery
14 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
Domino relies on Kubernetes network policies to manage secure communication between pods in the cluster. Net-
work policies are implemented by the network plugin, so your cluster use a networking solution which supports
NetworkPolicy, such as Calico.
Domino will need to be configured to serve from a specific FQDN, and DNS for that name should resolve to the address
of an SSL-terminating load balancer with a valid certificate. The load balancer must target incoming connections on
ports 80 and 443 to port 80 on all nodes in the Platform pool.
Health checks for this load balancer should use HTTP on port 80 and check for 200 responses from a path of /health
on the nodes.
The Domino Cluster Requirements Checker is a command-line utility that checks if a Kubernetes cluster conforms
to Domino requirements. The Cluster Requirements Checker is a plugin for Sonobuoy, a Kubernetes diagnostic tool.
The instructions on this page are used to run only the Domino plugin, and not the full Kubernetes conformance suite.
The Cloud Native Compute Foundation has certified many Kubernetes offerings. Kubernetes certification steps include
conformance tests run by Sonobuoy. Domino uses the Sonobuoy Plugin Framework to perform customized Domino
conformance checks on a cluster prior to installing Domino.
3.2.1 Instructions
You should perform the following steps from a workstation with kubectl admin access to the target cluster.
1. Install Sonobuoy binaries
Run the following command to determine the Kubernetes version for your cluster:
kubectl version
2. Set a KUBECONFIG environment variable to a path to a kubeconfig file with admin access to the target cluster.
export KUBECONFIG=~/.kube/config
3. Create a domino-checker.yaml configuration file with the following contents. You can download this file
from GitHub here
16 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
sonobuoy-config:
driver: DaemonSet
plugin-name: domino
result-format: junit
skip-cleanup: true
spec:
env:
- name: DOCKER_API_VERSION
value: '1.38'
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: RESULTS_DIR
value: /tmp/results
image: quay.io/domino/k8s-validator:latest
imagePullPolicy: Always
name: domino
securityContext:
privileged: false
volumeMounts:
- mountPath: /tmp/results
name: results
readOnly: false
- mountPath: /var/run/docker.sock
name: docker-mnt
readOnly: false
extra-volumes:
- name: docker-mnt
hostPath:
path: /var/run/docker.sock
Failed tests:
Node CPU
Node Memory
validator> sonobuoy delete --wait
INFO[0000] deleted kind=namespace
˓→namespace=sonobuoy
18 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
Run the following command to get more information about failed checks.
sonobuoy results $resultsfile --plugin domino --mode=dump
The output will look like this.
name: domino
status: failed
Items:
- name: gke-etienne-gke-1-build-13b06f55-8f2l
status: failed
Items:
- name: domino-junit.xml
status: failed
Meta:
file: results/gke-etienne-gke-1-build-13b06f55-8f2l/domino-junit.xml
Items:
- name: Domino Sonobuoy K8s Conformance Plugin
status: failed
Items:
- name: RWX Storage Class Available
status: passed
- name: Default Storage Class Set
status: passed
- name: Helm (Tiller) Service does not exist
status: passed
- name: Node Labels
status: passed
- name: Node CPU
status: failed
Details:
failure: Insufficient 24 required but only 8 of 24 available for Domino
- name: Node Memory
status: failed
Details:
failure: Insufficient 96Gi required but only 30880736Ki of 92642208Ki
˓→available
for Domino
- name: 'Docker Daemon Available: 4.14.145+'
status: passed
- name: gke-etienne-gke-1-compute-a5dfc474-g5s4
status: passed
Items:
- name: domino-junit.xml
status: passed
Meta:
file: results/gke-etienne-gke-1-compute-a5dfc474-g5s4/domino-junit.xml
Items:
- name: Domino Sonobuoy K8s Conformance Plugin
status: passed
- name: gke-etienne-gke-1-platform-a70f6fe2-fcss
status: passed
Items:
- name: domino-junit.xml
status: passed
Meta:
file: results/gke-etienne-gke-1-platform-a70f6fe2-fcss/domino-junit.xml
(continues on next page)
Domino 4 can run on a Kubernetes cluster provided by AWS Elastic Kubernetes Service. When running on EKS, the
Domino 4 architecture uses AWS resources to fulfill the Domino cluster requirements as follows:
• Kubernetes control moves to the EKS control plane with managed Kubernetes masters
• Domino uses a dedicated Auto Scaling Group (ASG) of EKS workers to host the Domino platform
• ASGs of EKS workers host elastic compute for Domino executions
• AWS S3 is used to store user data, internal Docker registry, backups, and logs
• AWS EFS is used to store Domino Datasets
• The kubernetes.io/aws-ebs provisioner is used to create persistent volumes for Domino executions
• Calico is used as a network plugin to support Kubernetes network policies
• Domino cannot be installed on EKS Fargate, since Fargate does not support stateful workloads with persistent
volumes.
• Instead of EKS Managed Node groups, Domino recommends creating custom node groups to allow for addi-
tional control and customized Amazon Machine Images. Domino recommends eksctl, Terraform, or Cloud-
Formation for setting up custom node groups.
20 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
All nodes in such a deployment have private IPs, and internode traffic is routed by internal load balancer. Nodes in the
cluster can optionally have egress to the Internet through a NAT gateway.
This section describes how to configure an Amazon EKS cluster for use with Domino.
VPC networking
If you plan to do VPC peering or set up a site-to-site VPN connection to connect your cluster to other resources like
data sources or authentication services, be sure to configure your cluster VPC accordingly to avoid any address space
collisions.
Namespaces
No namespace configuration is necessary prior to install. Domino will create three namespaces in the cluster during
installation, according to the following specifications:
Namespace Contains
platform Durable Domino application, metadata, platform services required for platform operation
compute Ephemeral Domino execution pods launched by user actions in the application
domino-system Domino installation metadata and secrets
Node pools
The EKS cluster must have at least two ASGs that produce worker nodes with the following specifications and distinct
node labels, and it may include an optional GPU pool:
The platform ASG can run in 1 availability zone or across 3 availability zones. If you want Domino to run with
some components deployed as highly available ReplicaSets you must use 3 availability zones. Using 2 zones is not
supported, as it results in an even number of nodes in a single failure domain. Note that all compute node pools you
use should have corresponding ASGs in any AZ used by other node pools. Setting up an isolated node pool in one
zone can cause volume affinity issues.
To run the default and default-gpu pools across multiple availability zones, you will need duplicate ASGs in
each zone with the same configuration, including the same labels, to ensure pods are delivered to the zone where the
required ephemeral volumes are available.
The easiest way to get suitable drivers onto GPU nodes is to use the EKS-optimized AMI distributed by Amazon as
the machine image for the GPU node pool.
Additional ASGs can be added with distinct dominodatalab.com/node-pool labels to make other instance
types available for Domino executions. Read Managing the Domino compute grid to learn how these different node
types are referenced by label from the Domino application.
Network plugin
Domino relies on Kubernetes network policies to manage secure communication between pods in the cluster. Net-
work policies are implemented by the network plugin, so your cluster use a networking solution which supports
NetworkPolicy, such as Calico.
Refer to the AWS documentation on installing Calico for your EKS cluster.
If you use the Amazon VPC CNI for networking, with only NetworkPolicy enforcement components of Calico, you
should ensure the subnets you use for your cluster have CIDR ranges of sufficient size, as every deployed pod in the
cluster will be assigned an elastic network interface and consume a subnet address. Domino recommends at least a /23
CIDR for the cluster.
Docker bridge
By default, AWS AMIs do not have bridge networking enabled for Docker containers. Domino requires this for
environment builds. Add --enable-docker-bridge true to the user data of the launch configuration used by
22 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
The EKS cluster must be equipped with an EBS-backed storage class that Domino will use to provision ephemeral
volumes for user execution. Consult the following storage class specification as an example.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: domino-compute-storage
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
fsType: ext4
Datasets storage
In order to store Datasets in Domino, an EFS (Elastic File System) must be configured. The EFS file system must be
provisioned and an access point configured to allow access from the EKS cluster.
Configure the access point with the following key parameters, also shown in the image below.
• Root directory path: /domino
• User ID: 0
• Group ID: 0
• Owner user ID: 0
• Owner group ID: 0
• Root permissions: 777
Record the file system and access point IDs for use when installing Domino.
24 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
Blob storage
When running in EKS, Domino can use Amazon S3 for durable object storage.
Create the following three S3 buckets:
• 1 bucket for user data
• 1 bucket for internal Docker registry
• 1 bucket for logs
• 1 bucket for backups
Configure each bucket to permit read and write access from the EKS cluster. This involves applying an IAM policy to
the nodes in the cluster like the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads"
],
"Resource": [
"arn:aws:s3:::$your-logs-bucket-name",
"arn:aws:s3:::$your-backups-bucket-name",
"arn:aws:s3:::$your-user-data-bucket-name",
"arn:aws:s3:::$your-registry-bucket-name"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload"
],
"Resource": [
"arn:aws:s3:::$your-logs-bucket-name/*",
"arn:aws:s3:::$your-backups-bucket-name/*",
"arn:aws:s3:::$your-user-data-bucket-name/*",
"arn:aws:s3:::$your-registry-bucket-name/*"
]
}
]
}
Record the names of these buckets for use when installing Domino.
Autoscaling access
If you intend to deploy the Kubernetes Cluster Autoscaler in your cluster, the instance profile used by your platform
nodes must have the necessary AWS Auto Scaling permissions.
See the following example policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeLaunchTemplateVersions",
"ec2:DescribeInstanceTypes"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
Domain
Domino will need to be configured to serve from a specific FQDN. To serve Domino securely over HTTPS, you will
also need an SSL certificate that covers the chosen name. Record the FQDN for use when installing Domino.
If you’ve applied the configurations described above to your EKS cluster, it should be able to run the Domino cluster
requirements checker without errors. If the checker runs successfully, you are ready for Domino to be installed in the
cluster.
26 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
See below for a sample YAML configuration file you can use with eksctl, the official EKS command line tool, to create
a Domino-compatible cluster.
Note that after creating a cluster with this configuration, you must still create the EFS and S3 storage systems and
configure them for access from the cluster as described above.
# $LOCAL_DIR/cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: domino-test-cluster
region: us-west-2
nodeGroups:
- name: domino-platform
instanceType: m5.2xlarge
minSize: 3
maxSize: 3
desiredCapacity: 3
volumeSize: 128
availabilityZones: ["us-west-2a"]
labels:
"dominodatalab.com/node-pool": "platform"
tags:
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>
- name: domino-default
instanceType: m5.2xlarge
minSize: 0
maxSize: 10
desiredCapacity: 1
volumeSize: 400
availabilityZones: ["us-west-2a"]
labels:
"dominodatalab.com/node-pool": "default"
"domino/build-node": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default"
"k8s.io/cluster-autoscaler/node-template/label/domino/build-node": "true"
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>
preBootstrapCommands:
- "cp /etc/docker/daemon.json /etc/docker/daemon_backup.json"
- "echo -e '.bridge=\"docker0\" | .\"live-restore\"=false' > /etc/docker/jq_
˓→script"
See below for a sample YAML configuration file you can use with eksctl, the official EKS command line tool, to create
a Domino-compatible cluster spanning multiple availability zones. Note that in order to avoid issues with execution
volume affinity, you must create duplicate groups in each AZ.
# $LOCAL_DIR/cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: domino-test-cluster
region: us-west-2
nodeGroups:
- name: domino-platform-a
instanceType: m5.2xlarge
minSize: 1
maxSize: 3
desiredCapacity: 1
volumeSize: 128
availabilityZones: ["us-west-2a"]
labels:
"dominodatalab.com/node-pool": "platform"
tags:
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
(continues on next page)
28 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
nodeGroups:
- name: domino-platform-b
instanceType: m5.2xlarge
minSize: 1
maxSize: 3
desiredCapacity: 1
volumeSize: 128
availabilityZones: ["us-west-2b"]
labels:
"dominodatalab.com/node-pool": "platform"
tags:
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>
nodeGroups:
- name: domino-platform-c
instanceType: m5.2xlarge
minSize: 1
maxSize: 3
desiredCapacity: 1
volumeSize: 128
availabilityZones: ["us-west-2c"]
labels:
"dominodatalab.com/node-pool": "platform"
tags:
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>
- name: domino-default-a
instanceType: m5.2xlarge
minSize: 0
maxSize: 3
volumeSize: 400
availabilityZones: ["us-west-2a"]
labels:
"dominodatalab.com/node-pool": "default"
"domino/build-node": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default"
"k8s.io/cluster-autoscaler/node-template/label/domino/build-node": "true"
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>
preBootstrapCommands:
- "cp /etc/docker/daemon.json /etc/docker/daemon_backup.json"
- "echo -e '.bridge=\"docker0\" | .\"live-restore\"=false' > /etc/docker/jq_
˓→script"
"k8s.io/cluster-autoscaler/node-template/label/domino/build-node": "true"
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>
preBootstrapCommands:
- "cp /etc/docker/daemon.json /etc/docker/daemon_backup.json"
- "echo -e '.bridge=\"docker0\" | .\"live-restore\"=false' > /etc/docker/jq_
˓→script"
"k8s.io/cluster-autoscaler/node-template/label/domino/build-node": "true"
"k8s.io/cluster-autoscaler/enabled": "true" #Optional for autodiscovery
"k8s.io/cluster-autoscaler/{{ cluster_name }}": "owned" #Optional for
˓→autodiscovery <insert your cluster_name>
preBootstrapCommands:
- "cp /etc/docker/daemon.json /etc/docker/daemon_backup.json"
- "echo -e '.bridge=\"docker0\" | .\"live-restore\"=false' > /etc/docker/jq_
˓→script"
30 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
- name: domino-gpu-b
instanceType: p2.8xlarge
minSize: 0
maxSize: 2
volumeSize: 400
availabilityZones: ["us-west-2b"]
ami:
ami-0ad9a8dc09680cfc2
labels:
"dominodatalab.com/node-pool": "default-gpu"
"nvidia.com/gpu": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default-gpu"
- name: domino-gpu-c
instanceType: p2.8xlarge
minSize: 0
maxSize: 2
volumeSize: 400
availabilityZones: ["us-west-2c"]
ami:
ami-0ad9a8dc09680cfc2
labels:
"dominodatalab.com/node-pool": "default-gpu"
"nvidia.com/gpu": "true"
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool":
˓→"default-gpu"
Domino 4 can run on a Kubernetes cluster provided by the Google Kubernetes Engine (GKE).
3.4.1 Overview
When running on GKE, the Domino 4 architecture uses GCP resources to fulfill the Domino cluster requirements as
follows:
• Kubernetes control is managed by the GKE cluster
• Domino uses one node pool of three n1-standard-8 worker nodes to host the Domino platform
• Additional node pools host elastic compute for Domino executions with optional GPU accelerators
• Cloud Filestore is used to store user data, backups, logs, and Domino Datasets
• A Cloud Storage Bucket is used to store the Domino Docker Registry.
• The kubernetes.io/gce-pd provisioner is used to create persistent volumes for Domino executions.
This section describes how to configure an GKE cluster for use with Domino.
Namespaces
No namespace configuration is necessary prior to install. Domino will create three namespaces in the cluster during
installation, according to the following specifications:
Namespace Contains
platform Durable Domino application, metadata, platform services required for platform operation
compute Ephemeral Domino execution pods launched by user actions in the application
domino-system Domino installation metadata and secrets
Node pools
The GKE cluster must have at least two node pools that produce worker nodes with the following specifications and
distinct node labels, and it may include an optional GPU pool:
32 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
If you want to configure the default-gpu pool, you must add a GPU accelerator the node pool. Read the GKE doc-
umentation on available accelerators and on deploying a DaemonSet that automatically installs the necessary drivers.
Additional node pools can be added with distinct dominodatalab.com/node-pool labels to make other in-
stance types available for Domino executions. Read Managing the Domino compute grid to learn how these different
node types are referenced by label from the Domino application.
Consult the Terraform snippets below for code representations of the required node pools.
Platform pool
initial_node_count = 3
autoscaling {
max_node_count = 3
min_node_count = 3
}
node_config {
preemptible = false
machine_type = "n1-standard-8"
labels = {
"dominodatalab.com/node-pool" = "platform"
}
disk_size_gb = 128
local_ssd_count = 1
}
management {
auto_repair = true
auto_upgrade = true
}
timeouts {
delete = "20m"
}
}
initial_node_count = 1
autoscaling {
max_node_count = 20
min_node_count = 1
}
node_config {
preemptible = false
machine_type = "n1-standard-8"
labels = {
"domino/build-node" = "true"
"dominodatalab.com/build-node" = "true"
"dominodatalab.com/node-pool" = "default"
}
disk_size_gb = 400
local_ssd_count = 1
}
management {
auto_repair = true
auto_upgrade = true
}
timeouts {
delete = "20m"
}
}
initial_node_count = 0
autoscaling {
max_node_count = 5
min_node_count = 0
}
node_config {
preemptible = false
machine_type = "n1-standard-8"
guest_accelerator {
type = "nvidia-tesla-p100"
count = 1
(continues on next page)
34 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
labels = {
"dominodatalab.com/node-pool" = "default-gpu"
}
disk_size_gb = 400
local_ssd_count = 1
workload_metadata_config {
node_metadata = "GKE_METADATA_SERVER"
}
}
management {
auto_repair = true
auto_upgrade = true
}
timeouts {
delete = "20m"
}
}
Domino relies on Kubernetes network policies to manage secure communication between pods in the cluster. By
default, the network plugin in GKE will not enforce these policies. To run Domino securely on GKE, you must enable
enforcement of network policies.
Read the GKE documentation for instructions on enabling network policy enforcement for your cluster.
The Domino installer will automatically create a storage class like the example below for use provisioning GCE
persistent disks as execution volumes. No manual setup is necessary for this storage class.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: dominodisk
parameters:
replication-type: none
type: pd-standard
provisioner: kubernetes.io/gce-pd
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
Shared storage
A Cloud Filestore instance must be provisioned with at least 10T of capacity and it must be configured to allow access
from the cluster. You will provide the IP address and mount path of this instance to the Domino installer, and it will
create an NFS storage class like the below.
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
labels:
app.kubernetes.io/instance: nfs-client-provisioner
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: nfs-client-provisioner
helm.sh/chart: nfs-client-provisioner-1.2.6-0.1.4
name: domino-shared
parameters:
archiveOnDelete: "false"
provisioner: cluster.local/nfs-client-provisioner
reclaimPolicy: Delete
volumeBindingMode: Immediate
You will need one Cloud Storage Bucket accessible from your cluster to be used for storing the internal Domino
Docker Registry.
Domain
Domino will need to be configured to serve from a specific FQDN. To serve Domino securely over HTTPS, you will
also need an SSL certificate that covers the chosen name. Record the FQDN for use when installing Domino. Once
Domino is deployed into your cluster, you must set up DNS for this name to point to an HTTPS Cloud Load Balancer
that has an SSL certificate for the chosen name, and forwards traffic to port 80 on your platform nodes.
If you’ve applied the configurations described above to your GKE cluster, it should be able to run the Domino cluster
requirements checker without errors. If the checker runs successfully, you are ready for Domino to be installed in the
cluster.
36 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
Domino 4 can run on a Kubernetes cluster provided by the Azure Kubernetes Service. When running on AKS, the
Domino 4 architecture uses Azure resources to fulfill the Domino cluster requirements as follows:
• For a complete Terraform module for Domino-compatible AKS provisioning, see terraform-azure-aks on
GitHub.
• Kubernetes control is handled by the AKS control plane with managed Kubernetes masters
• The AKS cluster’s default node pool is configured to host the Domino platform
• Additional AKS node pools provide compute nodes for user workloads
• An Azure storage account stores Domino blob data and datasets
• The kubernetes.io/azure-disk provisioner is used to create persistent volumes for Domino executions
• The Advanced Azure CNI is used for cluster networking, with network policy enforcement handled by Calico
• Ingress to the Domino application is handled by an SSL-terminating Application Gateway that points to a Ku-
bernetes load balancer
• Domino recommends provisioning with Terraform for extended control and customizability of all re-
sources. When setting up your Azure Terraform provider, please add a partner_id with a value of
31912fbf-f6dd-5176-bffb-0a01e8ac71f2 to enable usage attribution.
This section describes how to configure an AKS cluster for use with Domino.
Resource groups
You can provision the cluster, storage, and application gateway in an existing resource group. Note that in the process
of creating the cluster itself, Azure will create a separate resource group that will contain the cluster components
themselves.
Namespaces
No namespace configuration is necessary prior to install. Domino will create three namespaces in the cluster during
installation, according to the following specifications:
Namespace Contains
platform Durable Domino application, metadata, platform services required for platform operation
compute Ephemeral Domino execution pods launched by user actions in the application
domino-system Domino installation metadata and secrets
Node pools
The AKS cluster’s initial default node pool can be sized and configured to host the must have at least two node pools
that produce worker nodes with the following specifications and distinct node labels, and it may include an optional
GPU pool:
38 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
The recommended architecture configures the cluster’s initial default node pool with the correct label and size to serve
as the platform node pool. See the below cluster Terraform resource for a complete example.
name = example_cluster
enable_pod_security_policy = false
location = "East US"
resource_group_name = "example_resource_group"
dns_prefix = "example_cluster"
private_cluster_enabled = false
default_node_pool {
enable_node_public_ip = false
name = "platform"
node_count = 4
node_labels = { "dominodatalab.com/node-pool" : "platform" }
vm_size = "Standard_DS5_v2"
availability_zones = ["1", "2", "3"]
max_pods = 250
os_disk_size_gb = 128
node_taints = []
enable_auto_scaling = true
min_count = 1
max_count = 4
}
network_profile {
load_balancer_sku = "Standard"
network_plugin = "azure"
network_policy = "calico"
dns_service_ip = "100.97.0.10"
docker_bridge_cidr = "172.17.0.1/16"
service_cidr = "100.97.0.0/16"
}
A separate node pool for Domino default compute should be added after the cluster is created. Note that this is not
the initial cluster default node pool, but a separate node pool named default that is added to serve default Domino
compute. See the below node pool Terraform resource for a complete example.
enable_node_public_ip = false
kubernetes_cluster_id = "example_cluster_id"
name = "default"
node_count = 1
vm_size = "Standard_DS4_v2"
availability_zones = ["1", "2", "3"]
max_pods = 250
os_disk_size_gb = 128
os_type = "Linux"
node_labels = {
"domino/build-node" = "true"
"dominodatalab.com/build-node" = "true"
"dominodatalab.com/node-pool" = "default"
}
node_taints = []
enable_auto_scaling = true
min_count = 1
max_count = 20
Additional node pools can be added with distinct dominodatalab.com/node-pool labels to make other in-
stance types available for Domino executions. Read Managing the Domino compute grid to learn how these different
node types are referenced by label from the Domino application. When adding GPU node pools, keep in mind the
Azure guidance and best practices on using GPU nodes in AKS.
Network plugin
The Domino-hosting cluster should use the Advanced Azure CNI with network policy enforcement by Calico. See the
below network_profile configuration example.
network_profile {
load_balancer_sku = "Standard"
network_plugin = "azure"
network_policy = "calico"
dns_service_ip = "100.97.0.10"
docker_bridge_cidr = "172.17.0.1/16"
service_cidr = "100.97.0.0/16"
}
AKS clusters come equipped with several kubernetes.io/azure-disk backed storage classes by default.
Domino requires use of premium disks for adequate input and output performance. The managed-premium class
that is created by default can be used. Consult the following storage class specification as an example.
40 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
labels:
kubernetes.io/cluster-service: "true"
name: managed-premium
selfLink: /apis/storage.k8s.io/v1/storageclasses/managed-premium
parameters:
cachingmode: ReadOnly
kind: Managed
storageaccounttype: Premium_LRS
reclaimPolicy: Delete
volumeBindingMode: Immediate
Domino uses one Azure storage account for both blob data and files. See the below configuration for the two resources
required, the storage account itself and a blob container inside the account.
Record the names of these resources for use when installing Domino.
Domain
Domino will need to be configured to serve from a specific FQDN. To serve Domino securely over HTTPS, you will
also need an SSL certificate that covers the chosen name. Record the FQDN for use when installing Domino.
If you’ve applied the configurations described above to your AKS cluster, it should be able to run the Domino cluster
requirements checker without errors. If the checker runs successfully, you are ready for Domino to be installed in the
cluster.
See below for an example configuration file for the Domino installer based on the provisioning examples above.
schema: '1.0'
name: domino-deployment
version: 4.1.9
hostname: domino.example.org
pod_cidr: '100.97.0.0/16'
ssl_enabled: true
ssl_redirect: true
request_resources: true
enable_network_policies: true
enable_pod_security_policies: true
create_restricted_pod_security_policy: true
namespaces:
platform:
name: domino-platform
annotations: {}
labels:
domino-platform: 'true'
compute:
name: domino-compute
annotations: {}
labels: {}
system:
name: domino-system
annotations: {}
labels: {}
ingress_controller:
create: true
gke_cluster_uuid: ''
storage_classes:
block:
create: false
name: managed-premium
type: azure-disk
access_modes:
- ReadWriteOnce
base_path: ''
default: false
(continues on next page)
42 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
44 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
Starting with Domino 4.3.1, the Domino platform is compatible to run on the OpenShift Container Platform (OCP)
and OpenShift Kubernetes Engine (OKE). Domino supports the OCP/OKE version 4.4+.
This section describes how to configure an OpenShift Kubernetes Engine cluster for use with Domino.
Namespaces
No namespace configuration is necessary prior to install. Domino will create three namespaces in the cluster during
installation, according to the following specifications:
Namespace Contains
platform Durable Domino application, metadata, platform services required for platform operation
compute Ephemeral Domino execution pods launched by user actions in the application
domino-system Domino installation metadata and secrets
Node pools
The OpenShift cluster must have worker nodes with the following specifications and distinct node labels, and it may
include an optional GPU pool:
More generally, the platform worker nodes need an aggregate minimum of 24 CPUs and 96G of memory. Spreading
the resources across multiple nodes with proper failure isolation (e.g. availability zones) is recommended.
Managing nodes and node pools in OpenShift is done through Machine Management and the Machine API. For
each node pool above, you will need to create a MachineSet. Be sure to provide the Domino required labels in the
Machine spec (spec.template.spec.metadata.labels stanza). Also, update any provider spec per your
infrastructure provider of choice and sizing (spec.template.spec.providerSpec stanza); for example, in
AWS, updates may include, but not limited to: AMI ID, block device storaage sizing, and availability zone placement.
The following is an example MachineSet for the platform node pool:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
labels:
machine.openshift.io/cluster-api-cluster: firestorm-dxcpd
(continues on next page)
46 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
The following is an example MachineSet for the default (compute) node pool:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
labels:
machine.openshift.io/cluster-api-cluster: firestorm-dxcpd
name: firestorm-dxcpd-default-us-west-1a
namespace: openshift-machine-api
spec:
replicas: 3
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: firestorm-dxcpd
machine.openshift.io/cluster-api-machineset: firestorm-dxcpd-default-us-west-1a
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: firestorm-dxcpd
machine.openshift.io/cluster-api-machine-role: default
machine.openshift.io/cluster-api-machine-type: default
machine.openshift.io/cluster-api-machineset: firestorm-dxcpd-default-us-west-1a
spec:
metadata:
labels:
node-role.kubernetes.io/default: ""
dominodatalab.com/node-pool: default
domino/build-node: "true"
providerSpec:
value:
ami:
id: ami-02b6556210798d665
apiVersion: awsproviderconfig.openshift.io/v1beta1
blockDevices:
- ebs:
iops: 0
volumeSize: 400
volumeType: gp2
credentialsSecret:
name: aws-cloud-credentials
deviceIndex: 0
iamInstanceProfile:
id: firestorm-dxcpd-worker-profile
instanceType: m5.2xlarge
kind: AWSMachineProviderConfig
metadata:
creationTimestamp: null
placement:
availabilityZone: us-west-1a
region: us-west-1
publicIp: null
securityGroups:
- filters:
- name: tag:Name
values:
(continues on next page)
48 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
Node Autoscaling
For clusters on top of elastic cloud provider, node autoscaling (or Machine autoscaling) is achieved by creating Clus-
terAutoscaler and MachineAutoscaler resources.
The following is an example ClusterAutoscaler:
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
name: "default"
spec:
podPriorityThreshold: -10
resourceLimits:
maxNodesTotal: 20
cores:
min: 8
max: 256
memory:
min: 4
max: 256
gpus:
- type: nvidia.com/gpu
min: 0
max: 16
- type: amd.com/gpu
min: 0
max: 4
scaleDown:
enabled: true
delayAfterAdd: 10m
delayAfterDelete: 5m
delayAfterFailure: 30s
unneededTime: 10m
The following is an example MachineAutoscaler for the MachineSet created for the default node pool:
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
name: "firestorm-dxcpd-default-us-west-1a"
namespace: "openshift-machine-api"
spec:
(continues on next page)
Storage
Networking
Domain
Domino will need to be configured to serve from a specific FQDN. To serve Domino securely over HTTPS, you will
also need an SSL certificate that covers the chosen name.
Network Plugin
Domino relies on Kubernetes network policies to manage secure communication between pods in the cluster. By
default, OpenShift uses the Cluster Network Operator to deploy the OpenShift SDN default CNI network provider
plugin, which support network policies and hence should just work.
Ingress
Domino uses the NGNIX ingress controller maintained by the Kubernetes project instead of (but does not replace) the
OpenShift implemented HAProxy-based ingress controller and deploys the ingress controller as a node port service.
By default, the ingress listens on node ports 443 (HTTPS) and 80 (HTTP).
Load Balancer
A load balancer should be set up to use your DNS name. For example, in AWS, you will need to setup the DNS so it
points a CNAME at an Elastic Load Balancer.
After you complete the installation process, you must configure the load balancer to balance across the platform nodes
at the ports specified by your ingress.
50 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
External Resources
If you plan to connect your cluster to other resources like data sources or authentication services, pods running on the
cluster should have network connectivity to those resources.
Container Registry
Domino deploys its own container image registry instead of using the OpenShift built in container image registry.
During installation, the OpenShift cluster image configuration is modified to trust the Domino certificate authority
(CA). This is done to ensure that OpenShift can run pods using Domino’s custom built images. In the images.
config.openshift.io/cluster resource, you can find a reference to a ConfigMap that contains the Domino
CA.
spec:
additionalTrustedCA:
name: domino-deployment-registry-config
If you’ve applied the configurations described above to your OpenShift cluster, it should be able to run the Domino
cluster requirements checker without errors. If the checker runs successfully, you are ready for Domino to be installed
in the cluster.
NVIDIA DGX systems can run Domino workloads if they are added to your Kubernetes cluster as compute (worker)
nodes. Read below for how to setup and add DGXes to Domino.
The flow chart begins from the top left, with a Domino end user requesting a GPU tier.
If a DGX is already configured for use in Domino’s Compute Grid, the Domino platform administrator can define a
GPU-enabled Hardware Tier from within the Admin console.
The middle lane of the flow chart outlines the steps required to integrate a provisioned DGX system as a node in the
Kubernetes cluster that is hosting Domino, and subsequently configure that node as a GPU-enabled component of
Domino’s compute grid.
The bottom swim lane outlines that, to leverage a Nvidia DGX system with Domino, it must be purchased and provi-
sioned into the target infrastructure stack hosting Domino.
Nvidia DGX systems can be purchased through Nvidia’s Partner Network.. Install the DGX system in a hosting
environment with network access to additional host & storage infrastructure required to host Domino.
52 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
If this is a new (greenfield) deployment of Domino, one must first install & configure a Kubernetes cluster that meets
Domino’s Cluster Requirements, including valid configuration of your Kubernetes’ network policies to support secure
communication between pods that will host Domino’s platform services and compute grid.
Adding a DGX to an existing Domino is as simple as adding the DGX to your K8s API server as a worker node, with
a node label consistent with your chosen naming conventions. The default node label for GPU-based worker nodes is
‘default-gpu’.
Additionally, proper taints must be added to your DGX node. This facilitates the selection of the DGX for GPU-based
workloads running on Domino.
Configuring a Domino Hardware Tier to leverage your configured DGX Compute Node
Now that the DGX is added to your API server and labeled properly, we can move on to configuration of Domino
Hardware Tiers from within Domino’s Admin UI.
Domino provides governance features from within this interface, supporting LDAP/AD federation or SSO-based at-
tributes for managed access control and user execution quotas. We have also published a series of best practices for
managing hardware tiers in your compute grid.
Nvidia Driver
Configuration of the Nvidia driver at the host level should be performed by your Server administrator. The correct
Nvidia driver for your host can be identified by using the configuration guide found here. More information can be
found in the DGX Systems Documentation.
CUDA Version
The CUDA software version required for a given development framework, such as Tensorflow, will be documented
on their website. For example, Tensorflow >=2.1 requires CUDA 10.1 and some additional software packages, e.g.,
CuDNN.
CUDA & Nvidia Driver Compatibility
Once the correct CUDA version is identified for your specific needs, one must consult the CUDA-Nvidia Driver
Compatibility Table.
In the Tensorflow 2.1 example, the CUDA 10.1 requirement means one must be running CUDA >=10.1 and Nvidia
driver >=410.48 on the host. Table 1 in the link above will guide your choice of matching CUDA & Nvidia driver
versions.
Subsequently, the Domino Compute Environment must be configured to leverage the exact CUDA version that corre-
sponds to the desired application.
Simplifying this constraint, note that CUDA drivers provide backwards compatibility: the CUDA version on the host
can be greater or equal to that which is specified in your Compute Environment.
And because the CUDA software installation process often returns unexpected results when attempting to install an
exact CUDA version, including patch version, the fastest route to a functioning configuration is typically to install
the latest available minor release from your required major version of CUDA, and subsequently creating a Docker
environment variable (ENV) from within your Compute Environment that constrains compatible sets of CUDA, GPU
generations, and Nvidia drivers.
Need Additional Assistance?
Please consult your Domino customer success engineer for guidance on your specific needs. Domino can sample
configurations that will simplify your configuration process.
1. Build Node
We recommend you do not use a DGX GPU as a build node for environments. Instead, opt for a CPU resource
as part of your overall Domino architecture.
2. Splitting GPUs per Tier
We recommend providing several GPU tiers with different numbers of GPUs in each tier e.g. 1, 2, 4, and 8 GPU
hardware tiers as different training jobs can take use of single or parallel GPU usage and consuming a whole
DGX box for one workload may not be feasible in your environment.
3. Governance
After splitting up hardware tiers, access can be global or, alternatively, limited to specific organizations. We
recommend ensuring that the right organizations have GPU Hardware Tier access –or are restricted– for the
purpose of ensuring availability for critical work, and/or to prevent the unauthorized use of GPU tiers.
In the context of Kubernetes and Domino, multi-tenancy means a Kubernetes cluster (hereinafter simply referred to
as “cluster” unless otherwise disabiguated) that supports multiple applications and is not dedicated just to Domino
(i.e. each application is an individual cluster tenant). Domino supports multi-tenant clusters (or multi-tenancy) by
adhering to a set of principles that ensure it does not interfere with other applications or other cluster-wide services
that may exist. This also translates to the installation of Domino into a multi-tenant cluster, assuming typical best
practice multi-tenancy constraints.
• On-Premise and Capacity Constrained Environments. In this case, you are trying to maximize the utilization
of limited, often physical, infrastructure.
• Minimize Administration Costs.
54 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
• Shared Resource Loading. Multi-tenant clusters still share common resources, such as the Kubernetes control
plane (e.g. API server), DNS, and ingress. This results in how other applications will impact Domino and vice
versa.
• Imperfect Compute Isolation and Predictability. Unless you restrict node-level usage for applications, there
is no isolation at the node level. Hence, Domino Runs will potentially share compute with other applications.
Ill-behaved tenants could impact Domino Runs by hogging resources causing drops in resources available to
Domino or in the worst case, bring down the node. In most cases, this will probably not happen. However, if
particular Domino Runs need predictability or strict isolation, this may be an issue. You can reserve nodes just
for the Domino application in your cluster, but this does drive down the argument for multi-tenancy.
• Increased Security Complexity and Risk. Cluster administrators will likely have to manage a larger, or finer
grain, set of RBAC objects and rules. Shared resources and node-level coupling exposes an additional attack
surface for any malicious tenants.
• Shared Cluster Maintenance. Any cluster maintenance will cause all applications to be subject to the same
maintenance window. Hence, if the cluster maintenance is due to a particular application, all applications will
be subjected to the same down time even though they do not require that maintenance.
Note: Given the risks and data science workload profile, we highly recommend that where possible Domino be
deployed in its own Kubernetes cluster for enterprise and production scenarios.
Files
If two or more applications attempt to map a file from the “host path” and read or modify that file, then problems
can arise. The use of host paths are frowned upon except for monitoring software and currently, the only place that
Domino requires a host mount is for fluentd to monitor container logs. As this is standard practice for fluentd and an
explicitly read-only operation, we will not interfere with other applications.
System Settings
Applications that require system settings be modified for performance or reliability can interfere with or overwrite
other applications’ settings.
Elasticsearch
Currently, the only service that requires an updated setting for Domino is Elasticsearch and this is currently disable-able
if the cluster operators have an acceptable setting already. vm.map_max_count needs to be set for Elasticsearch to
work; This is not a Domino requirement, but a mandatory requirement from the upstream Elasticsearch Helm chart.
GPU Support
We deploy a number of services in order to properly expose GPUs for Domino. In a multi-tenant environment,
we would generally ask cluster administrators to manage these themselves, and we can disable our services via our
installer.
DaemonSets
Non-Namespaced Resources
ClusterRoles
Domino creates separate namespaces for its services and requires communication between these namespaces. Domino
creates a number of ClusterRoles and bindings that control access its namespaces or into global resources. As of
Domino 4.2, all Domino-created ClusterRoles are prefixed by the deployment name, which is specified by the name
key in the domino.yml configuration file (See Configuration Reference).
By default, Domino uses pod security policies (PSP) to ensure that, by default, pods cannot use system-level permis-
sions that they have not been granted. Unfortunately, PSPs are globally-namespaced so they too have been prefixed
with the deployment name. Applications cannot use these PSPs without explicitly being granted access through a Role
or Cluster Role.
Domino does not make extensive use of Custom Resource Definitions (CRDs) except for the on-demand spark feature
in 4.x. Our CRD is named uniquely, sparkclusters.apps.dominodatalab.com and should not interfere
with other applications.
Persistent Volumes
Domino uses persistent volumes extensively throughout the system to ensure that data storage is abstracted and per-
manent. With the exception of two shared storage mounts, which both incorporate namespaces to ensure uniqueness,
we strictly use dynamic volume creation through persistent volume claims which dynamically allocates a name that
will not conflict with any other application’s.
56 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
3.8.5 Recommendations
• Separate Node Pool for Platform and Compute. Even if Domino is installed in a multi-tenant cluster, we
prefer to have a separate node pool for our Platform and Compute Nodes. This is not always possible, but it’s a
decent compromise. Domino does set resource limits and requests so that it cannot overwhelm individual nodes.
Intra-cluster encryption in transit is implemented via a deployed service mesh, specifically Istio. At installation time,
Domino can deploy Istio for Domino use only, or Domino can be configured to leverage an existing deployed Istio
on the Kubernetes cluster (potentially shared with other applications). See Installation Configuration Reference for
details.
Out of the box, Istio provides scalable identity and X.509 certificate management for use with mTLS encryption,
including periodic certificate and key rotation. Because all encrypted communication is internal, these certificates are
not exposed or required for communication to any external services, such as web browsers and clients.
We do understand that certain enterprise policies mandate the use of corporate public key infrastructure (PKI) and
necessitate the use of certificate authority (CA) certificates.
Note: All certificates must be X.509 PEM format and keys must be passwordless.
Filename Description
root-cert.pem Root CA certificate for PKI.
ca-cert.pem Intermediate CA certificate from root CA. This is the Istio CA certificate.
ca-key.pem Private key for Istio CA certificate.
cert-chain.pem Full chain from ca-cert.pem to root-cert.pem (including both cer-
tificates).
A standard installation following the install process with the fleetcommand-agent (Domino installer) will auto-
matically pick up the created Secret and Istio will use the custom CA certificates.
This section describes how to update the custom CA certificate used by Istio for intra-cluster encryption in transit.
There are two scenarios:
1. No changes to the private key and common name
This assumes only ca-cert.pem is updated.
2. Updated to the private key, common name, or upstream certificates
Any of the certificate files have changed, including any upstream intermediate certificates.
In both cases, you need to create a new full chain certificate file (cert-chain.pem)
Tip: We recommend backing up existing certificates and keys before updating new ones.
The procedure to update the custom CA certificates is to create a Secret with a new files and restart the Istio daemon
(istiod).
# Delete existing secret with CA cert files
kubectl -n istio-system delete secret cacerts
58 Chapter 3. Kubernetes
Domino Admin Docs Documentation, Release 4.4.0
If changes have been made or are needed to the private key, common name (CN) or upstream certificates, a full restart
is required in addition to creating a new Secret with the new files an restarting the Istio daemon in the previous section.
done
3.10 Compatibility
Domino has been tested and verified to run on the following types of clusters:
3.10. Compatibility 59
Domino Admin Docs Documentation, Release 4.4.0
Vendor Partner
Cluster information
Rancher
If you have a cluster from another provider, you can check for compatibility by running the Domino cluster require-
ments checker. If you have questions about cluster compatibility, contact Domino.
60 Chapter 3. Kubernetes
CHAPTER 4
Installation
The Domino platform runs on Kubernetes. To simplify deployment and configuration of Domino services, Domino
provides an install automation tool called the fleetcommand-agent that uses Helm to deploy Domino into your
compatible cluster. The fleetcommand-agent is a Python application delivered in a Docker container, and can
be run locally or as a job inside the target cluster.
61
Domino Admin Docs Documentation, Release 4.4.0
4.1.1 Requirements
The install automation tools are delivered as a Docker image, and need to run on an installation workstation that meets
the following requirements:
• Docker installed
• Kubectl service account access to the cluster
• Access to download and install Helm via package manager or GitHub
• Access to quay.io to download the installer image
Additionally, you will need credentials for an installation service account that can access the Domino upstream image
repositories in quay.io. Throughout these instructions, these credentials will be referred to as $QUAY_USERNAME and
$QUAY_PASSWORD. Contact your Domino account team if you need new credentials.
The fleetcommand-agent needs access to two types of assets to install Domino:
1. Docker images for Domino components
2. Helm charts
The hosting cluster will need access to the following domains via Internet to retrieve component and dependency
images for online installation:
• quay.io
• domino.tech
• k8s.gcr.io
• docker.elastic.co
• docker.io
• gcr.io
Alternatively, you can configure the fleetcommand-agent to point to a private docker registry and application
registry for offline installation.
1. Log in to quay.io with the credentials described in the requirements section above.
2. Find the image URI for the version of the fleetcommand-agent you want to use from the release notes.
3. Pull the image to your local machine.
62 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
"Entrypoint": [
"python",
"-m",
"fleetcommand_agent"
]
This launches the Python application inside the container at /app/fleetcommand_agent. This allows you to
easily run agent commands via docker run like this:
init
• --image-registry
Provide a registry URI to prepend to Domino images to set up the template for installation from a private Docker
registry. Should be used in conjunction with --full.
Example:
run
Installs Domino into a cluster specified by a Kubernetes configuration from the KUBECONFIG environment variable.
A valid configuration file must be passed in to this command.
Arguments:
• --file -f
File system path to the complete and valid configuration file.
• --kubeconfig
Path to Kubernetes configuration file containing cluster and authentication information to use.
• --dry
Use this mode to not make any permanent changes to the target cluster. A dry run checks service account
permissions and generates detailed logs about the charts to be deployed with the given configuration. The
output is written to `/app/logs and /app/.appr_chart_cache inside the container.
Note that this option requires that the namespaces you want to use already exist, and for Helm 2 there must be
an accessible Tiller.
Example:
˓→domino.yml
destroy
Removes all resources from the target cluster for a given configuration file.
Arguments:
64 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
• --file -f
File system path to the complete and valid configuration file.
• --kubeconfig
Path to Kubernetes configuration file containing cluster and authentication information to use.
• --dry
Use this mode to not make any permanent changes to the target cluster. A dry run checks service account
permissions and generates detailed logs about the charts to be deployed with the given configuration.
Example:
1. Connect to a workstation that meets the install automation requirements listed above.
2. Log in to quay.io with the credentials described in the requirements section above.
4. Initialize the installer application to generate a template configuration file named domino.yml.
5. Edit the configuration file with all necessary details about the target cluster, storage systems, and hosting do-
main. Read the configuration reference for more information about available keys, and consult the configuration
examples for guidance on getting started.
Note that you should change the value of name from domino-deployment to something that identifies the
purpose of your installation and contains the name of your organization.
6. Run this install script from the directory with the finalized configuration file to install Domino into the cluster.
Note that you must fill in your $QUAY_USERNAME and $QUAY_PASSWORD where indicated, and also note that
this script assumes your installer configuration file is in the same directory, and is named exactly domino.yml.
#!/bin/bash
set -ex
66 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
set +e
while true; do
sleep 5
if kubectl logs -f fleetcommand-agent-install; then
break
fi
done
7. The installation process can take up to 30 minutes to fully complete. The installer will output verbose logs and
surface any errors it encounters, but it can also be useful to follow along in another terminal tab by running:
This will show the status of all pods being created by the installation process. If you see any pods enter a crash
loop or hang in a non-ready state, you can get logs from that pod by running:
If the installation completes successfully, you should see a message that says:
However, the application will only be accessible via HTTPS at that FQDN if you have configured DNS for
the name to point to an ingress load balancer with the appropriate SSL certificate that forwards traffic to your
platform nodes.
4.1.5 Upgrading
Upgrading a Domino deployment is a simple process of running the installer again with the same configuration, but
with the version field set the value of the desired upgrade version. See the installer configuration reference and the
installer release notes for information on the Domino versions your installer can support.
If you need to upgrade to a newer installer version to upgrade to your desired Domino version, use the process below.
1. Retrieve the new Domino installer image from quay.io by filling in the desired <version> value in the com-
mand below
2. Move your existing domino.yml configuration file to another directory, or rename it.
3. Generate a new domino.yml configuration template by running the initialization command through the new
version of the installer. This will ensure you have a configuration schema conformant to the new version.
4. Copy the values from your old configuration into the new file.
5. When complete, run the install script from the install process, being sure to change the spec.containers.
image value to quay.io/domino/fleetcommand-agent:<version> with the appropriate version.
68 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
4.2.1 Istio
This section configures how and if an Istio service mesh is deployed by or integrated to Domino. A Domino-deployed
Istio is for Domino use only. These configuration should only installed and/or enable if intra-cluster encryption in
transit is required.
This section configures the NGINX ingress controller deployed by the fleetcommand-agent.
4.2.3 Namespaces
Namespaces are a way to virtually segment Kubernetes executions. Domino will create namespaces according to the
specifications in this section, and the installer requires that these namespaces not already exist at installation time.
Storage Classes are a way to abstract the dynamic provisioning of volumes in Kubernetes.
Domino requires two storage classes:
1. block storage for Domino services and user executions that need fast I/O
2. shared storage that can be shared between multiple executions
Domino supports pre-created storage classes, although the installer can create a shared storage class backed by NFS
or a cloud NFS analog as long as the cluster can access the NFS system for read and write, and the installer can create
several types of block storage classes backed cloud block storage systems like Amazon EBS.
70 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
storage_classes.block.base_path
Base path to use on nodes
as a base when using
hostpath volumes
storage_classes.block.defaultWhether to set this storage X
• true
class as the default
• false
storage_classes.shared.efs.region
EFS store AWS region e.g. us-west-2
storage_classes.shared.efs.filesystem_id
EFS filesystem ID e.g. fs-7a535bd1
storage_classes.shared.nfs.server
NFS server IP or host-
name
storage_classes.shared.nfs.mount_path
Base path to use on the
server when creating
shared storage volumes
storage_classes.shared.nfs.mount_options
YAML List of additional e.g. - mfsymlinks
NFS mount options
storage_classes.shared.azure_file.storage_account
Azure storage account to
create filestores
Domino can store long-term, unstructed data in “blob storage” buckets. Currently, only the shared storage class
described above (NFS) and S3 are supported.
To apply a default S3 bucket or shared storage type to all use-cases of blob storage, it is only necessary to fill out the
default setting and make sure enabled is true. Otherwise, all other blob storage uses (projects, logs, and
backups) should be filled out.
4.2.6 Autoscaler
For Kubernetes clusters without native cluster scaling in response to new user executions, Domino supports the use of
the cluster autoscaler.
72 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
AWS Auto-Discovery
The cluster autoscaler supports autodiscovery on AWS. Without any explicit configuration of specific autoscaling
groups, it will detect all ASGs that have the appropriate tags and refresh them if their settings are updated directly.
This means listing all ASGs with accurate min/max settings (or listing them at all) is not required as referenced below
in the Groups section. ASG settings can be updated directly in AWS without having to update the cluster-autoscaler
configuration or rerun the installer.
By default, if no autoscaler.groups and autoscaler.auto_discovery.tags are specified, the cluster_name will be used to
look for the following AWS tags:
• k8s.io/cluster-autoscaler/enabled
• k8s.io/cluster-autoscaler/{{ cluster_name }}
The tags setting can be used to explicitly specify which resource tags the autoscaler service should look for.
If you would like to disable auto-discovery and continue using specific groups, ensure that auto_discovery.
cluster_name is an empty value.
Groups
Autoscaling groups are not dynamically discovered. Each autoscaling group must be individually specified including
the minimum and maximum scaling size.
Domino can automatically configure your cloud DNS provider. More extensive documentation can be found on the
external-dns homepage.
Domino supports SMTP for sending email notifications in response to user actions and run results.
74 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
4.2.9 Monitoring
Domino supports in-cluster monitoring with Prometheus as well as more detailed, external monitoring through
NewRelic APM and Infrastructure.
monitoring.newrelic.infrastructure
Enable NewRelic Infras- X
• true
tructure
• false
monitoring.newrelic.license_key
NewRelic account license
key
4.2.10 Helm
76 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
private_docker_registry.username
Docker registry username X
private_docker_registry.password
Docker registry password X
The recommended configuration for the internal Docker registry deployed with Domino. Override values are to allow
the registry to use S3, GCS, or Azure blob store as a backend store. GCS requires a service account already be bound
into the Kubernetes cluster with configuration to ensure the docker-registry service account is properly mapped.
4.2.13 Telemetry
4.2.14 GPU
If using GPU compute nodes, enable the following configuration setting to install the required components.
4.2.15 Fleetcommand
Domino supports upgrading minor patches through an internal tool named Fleetcommand.
78 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
Domino will by default deploy some DaemonSets on all available nodes in the hosting cluster. When running in a
multi-tenant Kubernetes cluster, where some nodes are available that should not be used by Domino, you can label
nodes for Domino with a single, consistent label, then provide that label to the fleetcommand-agent with the below
configuration to apply a selector to all Domino resources for that label.
Example
global_node_selectors:
domino-owned: "true"
This example would apply a selector for domino-owned=true to all Domino deployment resources.
The name of the Domino Ingress class can be changed with this setting. This should generally not need to change.
These settings control the Domino image caching service, which runs as a privileged pod and uses the host Docker
socket to pre-pull popular Domino environment images onto compute workers. It can be disabled if desired.
schema: '1.0'
name: $YOUR_ORGANIZATION_NAME
version: 4.3.3
hostname: $YOUR_DESIRED_APPLICATION_HOSTNAME
pod_cidr: '$YOUR_POD_CIDR'
ssl_enabled: true
ssl_redirect: true
request_resources: true
enable_network_policies: true
enable_pod_security_policies: true
global_node_selectors: {}
create_restricted_pod_security_policy: true
kubernetes_distribution: cncf
istio:
enabled: false
install: false
cni: true
namespace: istio-system
namespaces:
platform:
name: domino-platform
annotations: {}
labels:
domino-platform: 'true'
compute:
name: domino-compute
annotations: {}
labels:
domino-compute: 'true'
system:
name: domino-system
annotations: {}
labels: {}
ingress_controller:
create: true
gke_cluster_uuid: ''
class_name: nginx
storage_classes:
block:
create: true
name: dominodisk
(continues on next page)
80 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
82 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
Domino provides bundles of offline installation media for use when running the fleetcommand-agent without
Internet access to upstream sources of images and charts. To serve these resources, you must have a Docker registry
accessible to your cluster.
4.4.1 Downloading
You can find URLs of available offline installation bundles in the fleetcommand-agent release notes. These
bundles can be downloaded via cURL with basic authentication. Contact your domino account team for credentials.
Note that there is one file required: a versioned collection of images.
Example download:
The images bundle is a .tar archive that must be extracted before being used.
84 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
Once images have been loaded into your private registry you’re ready to install Domino.
4.4.3 Installing
To install Domino using a custom registry, the image references must be modified to reference the upstream registry.
Use the --image-registry argument on the init command to modify all image references to the external
registry.
docker run --rm -v $(pwd):/install quay.io/domino/fleetcommand-agent:v34 \
init --image-registry your-registry-url.domain:port --full --file /install/domino.yml
If your registry requires authentication, ensure the private_docker_registry section of your installer config-
uration is filled in with the correct credentials:
private_docker_registry:
server: your-registry-url.domain:port
username: '<username>'
password: '<password>'
Helm 3
Charts come pre packaged within the fleetcommand-agent image. Set up the helm object in configuration to
match the following:
helm:
version: 3
host: gcr.io
namespace: domino-eng-service-artifacts
prefix: ''
username: ''
(continues on next page)
insecure: false
cache_path: '/app/charts'
Note that the http protocol before the hostname in this configuration is important. Once these changes have been
made to your installer configuration file, you can run the fleetcommand-agent to install Domino.
4.4.4 Configuration
When performing offline installations there are 3 main central configuration keys that need to be repointed to the
private registry hosting the referenced images. From the Domino landing page, click Admin in the main menu. Then
in the administration portal, click Advanced > Central Config. Use the Add Record button at top right to add the
following records:
Key Value
com.cerebro.domino.builder.image IMAGE_URI of the latest domino/builder-job
com.cerebro.domino.computegrid. IMAGE_URI of the latest domino/executor
kubernetes.executor.imageName
com.cerebro.domino.modelmanager. IMAGE_URI of the latest domino/harness-proxy
harnessProxy.image
Image: quay.io/domino/fleetcommand-agent:v34
Installation bundles:
• 4.4.0 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v34-docker-images-4.4.0.tar
Changes
• Adds support for Domino 4.4.0
Image: quay.io/domino/fleetcommand-agent:v33
Installation bundles:
86 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
teleport_kube_agent:
enabled: false
proxyAddr: teleport.domino.tech:443
authToken: eeceeV4sohh8eew0Oa1aexoTahm3Eiha
• Domino 4.4.0 includes support for restartable workspace disaster recovery in AWS leveraging EBS snapshots.
To support this functionality, existing installations may potentially require additional IAM permissions for plat-
form node pool instances.
The permissions required, without any resource restriction (i.e. *), are the following:
– ec2:CreateSnapshot
– ec2:CreateTags
– ec2:DeleteSnapshot
– ec2:DeleteTags
– ec2:DescribeAvailabilityZones
– ec2:DescribeSnapshots
– ec2:DescribeTags
Known Issues
• If upgrading from Helm 2 to Helm 3, please read the release notes from v22 for caveats and known issues.
Image: quay.io/domino/fleetcommand-agent:v32
Installation bundles:
• 4.3.3 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v32-docker-images-4.3.3.tar
Changes:
• Fixes a memory leak in the EFS CSI driver.
Image: quay.io/domino/fleetcommand-agent:v31
Installation bundles:
Image: quay.io/domino/fleetcommand-agent:v30
Installation bundles:
• 4.3.3 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v30-docker-images-4.3.3.tar
Changes:
• Adds support for Domino 4.3.3
• The agent now supports installing Istio 1.7 (set istio.install to true), and installing Domino in Istio-
compatible mode (set istio.enabled to true).
istio:
enabled: false
install: false
cni: true
namespace: istio-system
• The EFS storage provider for new installs has changed from efs-provisioner to the EFS CSI driver, in
order to support encryption in transit to EFS. For existing installs, this does not require any changes unless
encryption in transit is desired. If a migration to encrypted EFS is necessary, please contact Domino support.
One limitation of the new driver, compared to the previous, is an inability to dynamically create directories
according to provisioned volumes. Support for pre-provisioned directories in AWS is done through access
points, which must be created before Domino can be installed.
To specify the access point at install time, ensure the filesystem_id is set in the format {EFS ID}::{AP ID}:
storage_classes:
shared:
efs:
filesystem_id: 'fs-285b532d::fsap-00cb72ba8ca35a121'
• Two new fields were added in order simplify DaemonSet management during upgrades for particularly large
clusters. DaemonSets do not have configuration options for upgrades and pods will be replaced one-by-one. For
large compute node pools, this can take a significant amount of time.
helm:
skip_daemonset_validation: false
daemonset_timeout: 300
Setting helm.skip_daemonset_validation to true will bypass post-upgrade validation that all pods
have been successfully recreated. helm.daemonset_timeout is an integer representing the number of
seconds to wait for all daemon pods in a DaemonSet to be recreated.
88 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
• 4.3.3 introduces limited availability of the new containerized Domino image builder: Forge. Forge can be
enabled with the ImageBuilderV2 feature flag, although Domino services must be restarted to cause this
flag to take effect. Running Domino image builds in a cluster that uses a non-Docker container runtime, such as
cri-o or containerd, requires that the feature flag be enabled.
To support the default rootless mode that Forge is configured to use, the worker nodes must support unprivileged
mounts, user namespaces, and overlayfs (either natively or through FUSE). Currently, GKE and EKS do not
support user namespace remapping and require the following extra configuration to properly use Forge.
services:
forge:
chart_values:
config:
fullPrivilege: true
Image: quay.io/domino/fleetcommand-agent:v29
Installation bundles:
• 4.3.2 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v29-docker-images-4.3.2.tar
Changes:
• Updated Keycloak migration job version.
Image: quay.io/domino/fleetcommand-agent:v28
Installation bundles:
• 4.3.2 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v28-docker-images-4.3.2.tar
Changes:
• Adds support for Domino 4.3.2
• Adds support for encrypted EFS access by using the EFS CSI driver.
• A new istio field has been added to the domino.yml schema for testing and development of future releases.
Domino 4.3.2 does not support Istio and therefore you must set enabled in this new section to false.
istio:
enabled: false
install: false
cni: true
namespace: istio-system
• New fields to specify static AWS access key and secret key credentials have been added. These are currently
unused and can be left unset.
blob_storage:
projects:
s3:
access_key_id: ''
secret_access_key: ''
• A new field for Teleport remote access integration has been added. This is currently unused and should be set
to false.
teleport:
remote_access: false
Image: quay.io/domino/fleetcommand-agent:v27
Installation bundles:
• 4.3.1 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v27-docker-images-4.3.1.tar
Changes:
• Fix a bug where dry-run installation could cause internal credentials to be improperly rotated.
Image: quay.io/domino/fleetcommand-agent:v26
Installation bundles:
• 4.3.1 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v26-docker-images-4.3.1.tar
• Latest charts http://mirrors.domino.tech/artifacts/appr/domino-appr-latest.tar.
gz
Changes:
• Adds support for Domino 4.3.1
• Adds support for running Domino on OpenShift 4.4+.
• A new field has been added to the installer configuration that controls whether or not the image caching service
is deployed.
image_caching:
enabled: true
90 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
• A new field has been added to the installer configuration that specifies the Kubernetes distribution for resource
compatibility. The available options are cncf (Cloud Native Computing Foundation) and openshift.
kubernetes_distribution: cncf
Image: quay.io/domino/fleetcommand-agent:v25
Installation bundles:
• 4.3.0 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-v25-docker-images-4.3.0.tar
• Latest charts http://mirrors.domino.tech/artifacts/appr/domino-appr-latest.tar.
gz
Changes:
• Adds support for Domino 4.3.0.
• A new cache_path field has been added to the helm configuration section. Leaving this field blank will
ensure charts are fetched from an upstream repository.
helm:
cache_path: ''
• To facilitate deployment of Domino into clusters with other tenants, a new global node selector field has been
added to the top-level configuration that allows an arbitrary label to be used for scheduling all Domino work-
loads. Its primary purpose is to limit workloads such as DaemonSets that would be scheduled on all available
nodes in the cluster to only nodes with the provided label. Note that this can override default node pool selectors
such as dominodatalab.com/node-pool: "platform", but does not replace them.
global_node_selectors:
domino-owned: "true"
• To facilitate deployment of Domino into clusters with other tenants, a configurable Ingress class has been added
to allow differentiation from other ingress providers in a cluster. If multiple Ingress objects are created with
the default class, it’s possible for other tenant’s paths to interfere with Domino and vice versa. Generally, this
setting does not need to change, but can be set to any arbitrary string value (such as domino).
ingress_controller:
class_name: nginx
Image: quay.io/domino/fleetcommand-agent:v24
Installation bundles:
Image: quay.io/domino/fleetcommand-agent:v23
Installation bundles:
• 4.2.2 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-docker-images-v23-4.2.2.tar
• Latest charts http://mirrors.domino.tech/artifacts/appr/domino-appr-latest.tar.
gz
Changes:
• Adds support for Domino 4.2.2.
• The known issue with v22 around Domino Apps being stopped after upgrade has been resolved. Apps will now
automatically restart after upgrade.
• The known issue with Elasticsearch not upgrading until manually restarted has been resolved. Elasticsearch will
automatically cycle through a rolling upgrade when the deployment is upgraded.
• Fixed an issue that prevented the fleetcommand-agent
• Adds support for autodiscovery of scaling resources by the cluster autoscaler.
Two new fields have been added under the autoscaler.auto_discovery key:
autoscaler:
auto_discovery:
cluster_name: domino
tags: [] # optional. if filled in, cluster_name is ignored.
92 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
Known Issues:
• If you’re upgrading from fleetcommand-agent v21 or older, be sure to read the v22 release notes and implement
the Helm configuration changes.
• An incompatibility between how nginx-ingress was initially installed and should be maintained going
forward means that action is required for both Helm 2 and Helm 3 upgrades.
For Helm 2 upgrades, add the following services object to your domino.yml to ensure compatibility:
services:
nginx_ingress:
chart_values:
controller:
metrics:
service:
clusterIP: "-"
service:
clusterIP: "-"
For Helm 3, there are two options. If nginx-ingress has not been configured to provide a cloud-native load
balancer that is tied to the hosting DNS entry, then nginx-ingress can be safely uninstalled prior to the
upgrade. If, however, the load balancer address must be maintained across the upgrade, then the initial upgrade
after the Helm 3 migration will fail. Before retrying the upgrade, execute the following commands.
export NAME=nginx-ingress
export SECRET=$(kubectl get secret -l owner=helm,status=deployed,name=$NAME -n
˓→domino-platform | awk '{print $1}' | grep -v NAME)
Image: quay.io/domino/fleetcommand-agent:v22
Installation bundles:
• 4.2.0 images https://mirrors.domino.tech/s3/domino-artifacts/offline/
opsless-docker-images-v22-4.2.0.tar
• Latest charts http://mirrors.domino.tech/artifacts/appr/domino-appr-latest.tar.
gz
Changes:
• Adds support for Domino 4.2.
• Adds support for Helm 3
The helm object in the installer configuration has been restructured to accommodate Helm 3 support. There is
now a helm.version property which can be set to 2 or 3. When using Helm 2, the configuration should be
similar to the below example. The username and password will continue to be standard Quay.io credentials
provided by Domino.
helm:
version: 2
host: quay.io
namespace: domino
prefix: helm- # Prefix for the chart repository, defaults to `helm-`
username: "<username>"
password: "<password>"
tiller_image: gcr.io/kubernetes-helm/tiller:v2.16.1 # Version is required and
˓→MUST be 2.16.1
insecure: false
When using Helm 3, configure the object as shown below. Helm 3 is a major release of the underlying tool
that powers installation of Domino’s services. Helm 3 removes the Tiller service, which was the server-side
component of Helm 2. This improves the security posture of Domino installation by reducing the scope and
complexity of required RBAC permissions, and it enables namespace isolation of services. Additionally, Helm
3 adds support for storing charts in OCI registries.
Currently, only gcr.io and mirrors.domino.tech are supported as chart repositories. If you are switching to Helm
3, you will need to contact Domino for gcr.io credentials. When using Helm 3, the helm configuration object
should be similar to the below example
helm:
version: 3
host: gcr.io
namespace: domino-eng-service-artifacts
insecure: false
username: _json_key # To support GCR authentication, this must be "_json_key"
password: "<password>"
tiller_image: null # Not required for Helm 3
prefix: '' # Charts are stored without a prefix by default
Migration of an existing Helm 2 installation to Helm 3 is done seamlessly within the installer. Once successful,
Tiller will be removed from the cluster and all Helm 2 configuration is deleted.
Known Issues:
• Elasticsearch is currently configured to only upgrade when the pods are deleted. To properly upgrade an existing
deployment from Elasticsearch 6.5 to 6.8, after running the installer use the rolling upgrade process. This
involves first deleting the elasticsearch-data pods, then the elasticsearch-master pods. See
the example procedure below.
• An incompatibility between how nginx-ingress was initially installed and should be maintained going
forward means that action is required for both Helm 2 and Helm 3 upgrades.
For Helm 2 upgrades, add the following services object to your domino.yml to ensure compatibility:
94 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
services:
nginx_ingress:
chart_values:
controller:
metrics:
service:
clusterIP: "-"
service:
clusterIP: "-"
For Helm 3, there are two options. If nginx-ingress has not been configured to provide a cloud-native load
balancer that is tied to the hosting DNS entry, then nginx-ingress can be safely uninstalled prior to the
upgrade. If, however, the load balancer address must be maintained across the upgrade, then the initial upgrade
after the Helm 3 migration will fail. Before retrying the upgrade, execute the following commands.
export NAME=nginx-ingress
export SECRET=$(kubectl get secret -l owner=helm,status=deployed,name=$NAME -n
˓→domino-platform | awk '{print $1}' | grep -v NAME)
• Domino Apps do not currently support a live upgrade from version 4.1 to version 4.2. After the upgrade, all
Apps will be stopped.
To restart them, you can use the /v4/modelProducts/restartAll endpoint like in the below example,
providing an API key for a system administrator.
Image: quay.io/domino/fleetcommand-agent:v21
Changes:
• Adds support for Domino 4.1.10
Known issues:
• The deployed version 8.0.1 of Keycloak has an incorrect default First Broker Login authentication flow.
When setting up an SSO integration, you must create a new authentication flow like the one below. Note
that the Automatically Link Account step is a custom flow, and the Create User if Unique
and Automatically Set Existing User executions must be nested under it by adding them with the
Actions link.
Image: quay.io/domino/fleetcommand-agent:v20
Changes:
• Support for 4.1.9 has been updated to reflect a new set of artifacts.
Known issues:
• The deployed version 8.0.1 of Keycloak has an incorrect default First Broker Login authentication flow.
When setting up an SSO integration, you must create a new authentication flow like the one below. Note
that the Automatically Link Account step is a custom flow, and the Create User if Unique
and Automatically Set Existing User executions must be nested under it by adding them with the
Actions link.
Image: quay.io/domino/fleetcommand-agent:v19
Changes:
• Added catalogs for Domino up to 4.1.9
• Added support for Docker NO_PROXY configuration. Domino containers will now respect the configuration
and connect to the specified hosts without proxy.
Known issues:
• The deployed version 8.0.1 of Keycloak has an incorrect default First Broker Login authentication flow.
When setting up an SSO integration, you must create a new authentication flow like the one below. Note
that the Automatically Link Account step is a custom flow, and the Create User if Unique
96 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
and Automatically Set Existing User executions must be nested under it by adding them with the
Actions link.
Image: quay.io/domino/fleetcommand-agent:v18
Changes:
The following new fields have been added to the fleetcommand-agent installer configuration.
1. Storage class access modes
The storage_class options have a new field called access_modes that allows configuration of the un-
derlying storage class’ allowed access modes.
storage_classes:
block:
[snip]
access_modes:
- ReadWriteOnce
git:
storage_class: dominodisk
git:
storage_class: dominoshared
services:
git_server:
chart_values:
persistence:
size: 5Ti
Image: quay.io/domino/fleetcommand-agent:v17
Changes:
• Added catalogs for Domino up to 4.1.8
Image: quay.io/domino/fleetcommand-agent:v16
Changes:
• Added catalogs for Domino up to 4.1.7
• Calico CNI is now installed by default for EKS deployments
• AWS Metadata API is blocked by default for Domino version >= 4.1.5
• Added Private registry support in the Installer
• New Install configuration attributes (see the reference documentation for more details):
– sse_kms_key_id option for Blob storage
– gcs option for Google Cloud Storage
– Namespaces now support optional labels to apply labels during installation
– teleport for Domino managed installations only
Image: quay.io/domino/fleetcommand-agent:v15
Changes:
• Added catalog for Domino 4.1.4
• Ensure fleetcommand-agent also deletes system namespace.
• Updated version of Cluster Autoscaler to 1.13.9
Image: quay.io/domino/fleetcommand-agent:v14
Changes:
98 Chapter 4. Installation
Domino Admin Docs Documentation, Release 4.4.0
Configuration
The Central Configuration is where all global settings for a Domino installation are enumerated. You can access the
Central Configuration interface from the Admin portal by clicking Advanced > Central Config.
The interface is organized into a list of records. You can click on an existing record to edit its attributes, or you can
add a record with the Add Record button at top right. If there is no record explicitly set for an option, the default
value will be used. In order for changes made in the Central Config to take effect, you must to restart Domino services
using the link at the top of the interface.
101
Domino Admin Docs Documentation, Release 4.4.0
These options are related to project visibility settings and are available in namespace common and should be recorded
with no name.
These options are related to email notifications from Domino and are available in namespace common and should be
recorded with no name.
These options are related to Model APIs and are available in namespace common and should be recorded with no
name.
5.1.4 Environments
These options are related to Domino Environments and are available in namespace common and should be recorded
with no name.
5.1.5 Authentication
These options are related to the Keycloak authentication service and are available in namespace common and should
be recorded with no name.
These options are related to long-running workspace sessions and are available in namespace common and should be
recorded with no name.
These options are related to datasets scratch spaces and are available in namespace common and should be recorded
with no name.
These options are related to the compute grid and are available in namespace common and should be recorded with
no name.
These options are related to the on-demand Spark clusters and are available in namespace common and should be
recorded with no name.
These options are related to the file contents download API endpoint and are available in namespace common and
should be recorded with no name.
5.1.11 Builder
5.1.12 Workspaces
5.1.13 Authorization
• Overview
• Setting custom new-user default projects
5.2.1 Overview
By default, every new user in Domino is the owner of a quick-start project. This project is created when the user
signs up, and it contains many useful sample files that show how to take advantage of Domino features, plus a detailed
README.
Admins can replace the default quick-start with one or more customized new-user default projects.
First, create the projects that you want all new users to own a copy of upon signup. These projects should have names,
descriptions, and READMEs that make it clear to new users what they’ll find in the project. These projects can be
owned by any user, however they should be Private projects.
Record the username and project name paths for these projects. For example:
admin-user/getting-started-project
admin-user/sample-app-project
Note that the name of the project will be reproduced for new users. If you set the example projects above as default
projects, all new users will own copies at:
new-username/getting-started-project
new-username/sample-app-project
Once your projects are ready for use by new users, set the following central configuration option.
Namespace: common
Key: com.cerebro.domino.frontend.overrideDefaultProject
Value: string of comma separated project paths
For the examples shown above, the value of this setting would be:
admin-user/getting-started-project, admin-user/sample-app-project
As a data science leader, you have the ability to define a set of custom project stages that users in Domino can use to
label their projects for creating useful views in the Projects Portfolio. These stages can be used to mark a project’s
progress through the workflow and life cycle your team uses. To learn more about how users interact with and set
project stages, read about stage and status in the projects overview
To set up the stages that will be available to users in your Domino platform, open the Admin interface, then click
Advanced > Project Stage Configuration.
On the project stage configuration interface, you can click Add Record to create a new stage label that will be available
for Domino users to set on their projects. The record at the top of the list is the default stage all new projects created
in Domino will have, and projects can be changed to any other available stage.
These stages are a custom set of labels that allow your Domino users to communicate progress in a project to their
colleagues and to leadership. It’s up to you as a data science leader to determine the stages that you want available,
and to communicate to your team how they should be used.
Domino recommends setting up a custom default project for new users with information in the README about your
teams practices, available environments, and how users should use project stages.
Domino can integrate with Atlassian Jira to enable users to interact with Jira from inside a Domino project.
This document describes how to link Domino to Jira. Once this configuration is done, users with a Domino account
and a Jira account can link them via OAuth.
5.4.1 Requirements
Domino supports both Jira Cloud and Jira Server version 7.1.6+. For Jira integration to work, an application link needs
to be configured between Domino and Jira.
This process requires system administrator access to Domino and also a Jira account with admin permissions.
5.4.2 Configuration
Step 1:
In the Domino admin UI, under Advanced click Feature Flags. Set ShortLived.JiraIntegrationEnabled
to True.
Step 2:
In the Domino admin UI, under Advanced click Jira Configuration. Provide the URL of your Jira service then click
Add configuration.
You will need the details on this page in subsequent steps. Please note the Public Key, Incoming Consumer
Key and Incoming Consumer Name as these won’t be visible once you move away from this screen.
This step adds the relevant central config values and need a restart of the Domino services. Click the restart services
link.
Step 3:
Log in to your Atlassian/Jira account. Note that you would need to have admin privileges on this account to proceed
further.
1. Click the gear icon to open setting, then click Products
8. Click on Continue and in the next form and provide the Consumer Name, Consumer Key and Public Key from
Step 2
5.4.3 Dashboard
All projects which have a Jira ticket linked to them will be visible in the Jira Configuration page (Admin -> Advanced
-> Jira Configuration). An Admin can choose to unlink projects directly from this screen for all the projects.
Domino can be reconfigured to use another Jira instance or delete the configuration with the following steps:
1. Unlink all jira linked projects
2. Go to Jira Configuration Page and delete current configuration
3. Follow the steps in Configuration section to link a new connection
Compute
Compute nodes available to run user workloads in Domino are conceptually organized into node pools based on
their Kubernetes labels. The set of pools available to users is referred to as the Domino Compute Grid, and it is the
responsibility of Domino administrators to manage and configure these pools.
6.1.1 Overview
Users in Domino assign their Runs to Domino Hardware Tiers. A hardware tier defines the type of machine a job will
run on, and the resource requests and limits for the pod that the Run will execute in. When configuring a hardware
tier, you will specify the machine type by providing a Kubernetes node label.
You should create a Kubernetes node label for each type of node you want available for compute workloads in Domino,
and apply it consistently to compute nodes that meet that specification. Nodes with the same label become a node pool,
and they will be used as available for Runs assigned to a Hardware Tier that points to their label.
Which pool a Hardware Tier is configured to use is determined by the value in the Node Pool field of the Hardware
Tier editor. In the screenshot below, the large-k8s Hardware Tier is configured to use the default node pool.
121
Domino Admin Docs Documentation, Release 4.4.0
The diagram below shows a cluster configured with two node pools for Domino, one named default and one named
default-gpu. You can make additional node pools available to Domino by labeling them with the same scheme:
dominodatalab.com/node-pool=<node-pool-name>. The arrows in this diagram represent Domino re-
questing that a node with a given label be assigned to a Run. Kubernetes will then assign the Run to a node in the
specified pool that has sufficient resources.
By default, Domino creates a node pool with the label dominodatalab.com/node-pool=default and all
compute nodes Domino creates in cloud environments are assumed to be in this pool. Note that in cloud environments
with automatic node scaling, you will configure scaling components like AWS Auto Scaling Groups or Azure Scale
Sets with these labels to create elastic node pools.
Every Run in Domino is hosted in a Kubernetes pod on a type of node specified by the selected Hardware Tier.
The pod hosting a Domino Run contains three containers:
1. The main Run container where user code is executed
2. An NGINX container for handling web UI requests
3. An executor support container which manages various aspects of the lifecycle of a Domino execution, like
transferring files or syncing changes back to the Domino file system
The amount of compute power required for your Domino cluster will fluctuate over time as users start and stop Runs.
Domino relies on Kubernetes to find space for each execution on existing compute resources. In cloud autoscaling
environments, if there’s not enough CPU or memory to satisfy a given execution request, the Kubernetes cluster
autoscaler will start new compute nodes to fulfill that increased demand. In environments with static nodes, or in
cloud environments where you have reached the autoscaling limit, the execution request will be queued until resources
are available.
Autoscaling Kubernetes clusters will shut nodes down when they are idle for more than a configurable duration. This
reduces your costs by ensuring that nodes are used efficiently, and terminated when not needed.
Cloud autoscaling resources have properties like the minimum and maximum number of nodes they can create. You
should set the node maximum to whatever you are comfortable with given the size of your team and expected volume
of workloads. All else equal, it is better to have a higher limit than a lower one, as nodes are cheap to start up and
shut down, while your data scientists’ time is very valuable. If the cluster cannot scale up any further, your users’
executions will wait in a queue until the cluster can service their request.
The amount of resources Domino will request for a Run is determined by the selected Hardware Tier for the Run.
Each Hardware Tier has five configurable properties that configure the resource requests and limits for Run pods.
• Cores
The number of requested CPUs.
• Cores limit
The maximum number of CPUs. Recommended to be the same as the request.
• Memory
The amount of requested memory.
• Memory limit
The maximum amount of memory. Recommended to be the same as the request.
• Number of GPUs
The number of GPU cards available.
The request values, Cores and Memory, as well as Number of GPUs, are thresholds used to determine whether a node
has capacity to host the pod. These requested resources are effectively reserved for the pod. The limit values control
the amount of resources a pod can use above and beyond the amount requested. If there’s additional headroom on the
node, the pod can use resources up to this limit.
However, if resources are in contention, and a pod is using resources beyond those it requested, and thereby causing
excess demand on a node, the offending pod may be evicted from the node by Kubernetes and the associated Domino
Run is terminated. For this reason, Domino strongly recommends setting the requests and limits to the same values.
To prevent a single user from monopolizing a Domino deployment, an administrator can set a limit on the number
of simultaneous executions that a user can have running concurrently. Once the number of simultaneously running
executions is reached for a given user, any additional executions will be queued. This includes executions for Domino
workspaces, jobs, web applications, as well as any executions that make up an on-demand distributed compute cluster.
For example, in the case of an on-demand Spark cluster an execution slot will be consumed for each Spark executor
and for the master.
See Important settings for details.
From the top menu bar in the admin UI, click Infrastructure. You will see both Platform and Compute nodes in this
interface. Click the name of a node to get a complete description, including all applied labels, available resources, and
currently hosted pods. This is the full kubectl describe for the node. Non-Platform nodes in this interface with
a value in the Node Pool column are compute nodes that can be used for Domino Runs by configuring a Hardware
Tier to use the pool.
From the top menu of the admin UI, click Executions. This interface lists active Domino execution pods and shows
the type of workload, the Hardware Tier used, the originating user and project, and the status for each pod. There
are also links to view a full kubectl describe output for the pod and the node, and an option to download the
deployment lifecycle log for the pod generated by Kubernetes and the Domino application.
Each Spark node, including master and worker nodes, launched as part of an on-demand Spark cluster will be displayed
as a separate row in the executions interface, with complete information available on the originating project and user,
as well as the hardware tier.
From the top menu of the admin UI, click Advanced > Hardware Tiers, then on the Hardware Tiers page click New
to create a new Hardware Tier or Edit to modify an existing Hardware Tier.
Keep in mind that your Hardware Tier’s CPU, memory, and GPU requests should not exceed the available resources
of the machines in the target node pool after accounting for overhead. If you need more resources than are available on
existing nodes, you may need to add a new node pool with different specifications. This may mean adding individual
nodes to a static cluster, or configuring new auto-scaling components that provision new nodes with the required
specifications and labels.
You can allow hardware tiers to exceed the default limit of 64MB for shared memory. This is especially beneficial for
applications that can make use of shared memory.
From the top of the menu admin UI, click Advanced > Hardware Tiers, then on the Hardware Tiers page click New
to create a new Hardware Tier or Edit to modify an existing Hardware Tier. Check the Allow executions to exceed
the default shared memory limit checkbox.
Checking this option will override the /dev/shm (shared memory) limit, and any shared memory consumption will
count toward the overall memory limit of the hardware tier. Be sure to consider and incorporate the size of /dev/shm
in any memory usage calculations for a hardware tier with this option enabled.
Warning: /dev/shm is considered part of the overall memory footprint of an execution container. It is possible
to exceed the total memory of the container when overriding dev/shm to use more shared memory. Exceeding
the container’s memory limit via dev/shm will terminate the container.
The following settings in the common namespace of the Domino central configuration affect compute grid behavior.
• Key: com.cerebro.computegrid.timeouts.sagaStateTimeouts.
deployingStateTimeoutSeconds
• Value: Number of seconds an execution pod in a deploying state will wait before timing out. Default is 60 * 60
(1 hour).
• Key: com.cerebro.computegrid.timeouts.sagaStateTimeouts.
preparingStateTimeoutSeconds
• Value: Number of seconds an execution pod in a preparing state will wait before timing out. Default is 60 * 60
(1 hour).
• Key: com.cerebro.domino.computegrid.userExecutionsQuota.
maximumExecutionsPerUser
• Value: Maximum number of executions each user may have running concurrently. If a user tries to run more
than this, the excess executions will queue until existing executions finish. Default is 25.
• Key: com.cerebro.computegrid.timeouts.sagaStateTimeouts.
userExecutionsOverQuotaStateTimeoutSeconds
• Value: Number of seconds an execution pod that cannot be assigned due to user quota limitations will wait for
resources to become available before timing out. Default is 24 * 60 * 60 (24 hours).
• Overview
• Accounting for overhead
– Kubernetes management overhead
– Domino daemon-set overhead
– Domino execution overhead
– When should I account for overhead?
– Example
• Isolating workloads and users using node pools
• Set resource requests and limits to the same values
6.2.1 Overview
Domino Hardware Tiers define Kubernetes requests and limits and link them to specific node pools. We recommend
the following best practices.
1. Accounting for overhead
2. Isolating workloads and users using node pools
3. Setting resource requests and limits to the same values
When designing hardware tiers, you need to take into account what resources will be available on a given node when
Domino submits your workload for execution. Not all physical memory and CPU cores of your node will be available
due to system overhead.
You should consider the following overhead components:
1. Kubernetes management overhead
2. Domino daemon-set overhead
3. Domino execution sidecar overhead
Kubernetes typically reserves a portion of each node’s capacity for daemons and pods that are required to for Ku-
bernetes itself. The amount of reserved resources usually scales with the size of the node, and also depends on the
Kubernetes provider or distribution.
Click the links below to view information on reserved resources for cloud-provider managed Kubernetes offerings:
• AWS EKS
• Azure AKS
• Google GKE
The best way to understand the available resources for your instance is to check one of your compute nodes with the
kubectl describe nodes command and then look for the Allocatable section of the output. It will show
the memory and CPU available for Domino.
Domino runs a set of management pods that reside on each of the compute nodes. These are used for things like log
aggregation, monitoring, and environment image caching.
The overhead of these daemon-sets is roughly 0.5 CPU cores and 0.5 Gi RAM. This overhead is taken from the
allocatable resources on the node.
Lastly, for each Domino execution, there are a set of supporting containers in the execution pod that manage authenti-
cation, handle request routing, loading files, and installing dependencies. These supporting containers make CPU and
memory requests that Kubernetes takes into account when scheduling execution pods.
The supporting container overhead currently is roughly 1 CPU core and 1.5 GiB RAM. This is configurable and may
vary for your specific deployment.
Overhead is relevant if you want to define a hardware tier dedicated to one execution at a time per node, such as for a
node with a single physical GPU. It is also relevant if you absolutely need to maximize node density.
Example
Consider an m5.2xlarge EC2 node with raw capacity of 8 CPU cores and 32 GiB of RAM.
When used as part of an EKS cluster, the node reports the following allocatable capacity of ~27GiB of RAM and
7910m CPU cores.
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 8
ephemeral-storage: 104845292Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32120476Ki
pods: 58
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 7910m
ephemeral-storage: 95551679124
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 28372636Ki
pods: 58
On top of that above, conservatively account for 500m CPU and 0.5GiB of RAM for the Domino and EKS daemons.
Lastly, for a single execution add 1000m CPU and 1.5GiB RAM for sidecars, and you are left with roughly 6410m
CPU and 25GiB RAM that you can use for a single large hardware tier.
If you want to partition the node into smaller hardware tiers, you will need to account for the sidecar overhear for
every execution that you want to colocate.
As a general rule, larger nodes allow for more flexibility as Kubernetes will take care of efficiently packing your
executions onto the available capacity.
You can see which pods are running on a specific node by visiting the Infrastructure admin page and clicking on
the name of the node. In the image below, there is a box around the execution pods. The other pods handle logging,
caching, and other services.
Node pools are defined by labels added to nodes in a specific format: dominodatalab.com/
node-pool=<your-node-pool>. In the hardware tier form, you just need to include your-node-pool. You
can name a node pool anything you like, but we recommend naming them something meaningful given the intended
use.
Domino typically comes pre-configured with default and default-gpu node pools, with the assumption that
most user executions will run on nodes in one of those pools. As your compute needs become more sophisticated, you
may want to keep certain users separate from one another or provide specialized hardware to certain groups of users.
So if there’s a data science team in New York City that needs a specific GPU machine that other teams don’t need, you
could use the following label for the appropriate nodes: dominodatalab.com/node-pool=nyc-ds-gpu. In
the hardware tier form, you would specify nyc-ds-gpu. To ensure only that team has access to those machines,
create a NYC organization, add the correct users to the organization, and give that organization access to the new
hardware tier that uses the nyc-ds-gpu node pool label.
With Kubernetes, resource limits must be >= resource requests. So if your memory request is 16 GiB, your limit
must be >= 16 GiB. But while setting a request > limit can be useful - there are cases where allowing bursts of CPU
or memory can be useful - this is also dangerous. Kubernetes may evict a pod using more resources than initially
requested. For Domino workspaces or jobs, this would cause the execution to be terminated.
It is for this reason that we recommend setting memory and CPU requests equal to limits. In this case, Python and R
cannot allocate more memory than the limit, and execution pods will not be evicted.
On the other hand, if the limit is higher than the request, it is possible for a user to use resources that another user’s
execution pod should be able to access. This is the “noisy neighbor” problem that you may have experienced in other
multi-user environments. But instead of allowing the noisy neighbor to degrade performance for other pods on the
node, Kubernetes will evict offending pod when necessary to free up resources.
User data on disk will not be lost, because Domino stores user data on a persistent volume that can be reused. But
anything in memory will be lost and the execution will have to be restarted.
6.3.1 Overview
The pods that host Model APIs have hardware specifications based on resource quotas set by Domino system admin-
istrators. A resource quota determines the CPU and memory resources available to the Model that uses it.
Users will be able to access resource quotas from a dropdown menu on the Model deployment page.
From the admin home, click Advanced -> Resource Quotas to open the management interface.
From here you can create, edit, and set default resource quotas. Resource quotas cannot be permanently deleted. To
make a resource quota unavailable for use, edit it and set Visible to false.
Resource quotas have the following properties:
• CPUs requested The number of cores that will be reserved for a Model with this quota.
• Memory requested The amount of RAM that will be reserved for a model with this quota.
• CPU limit If the hosting node has idle cores available, a model running this quota can make use of additional
cores up to this limit.
• Memory limit If the hosting node has RAM available, a model running this quota can make use of additional
memory up to this limit.
• Visible This property on a resource quota must be set to true for the quota to appear in the dropdown selector
for users publishing Models.
• Default The resource quota with this set to true is the quota that will be used for all newly published Models
by default.
• Overview
• Definitions
• Storage workflow for Jobs
• Storage workflow for Workspaces
• Resumable Workspace volume backups on AWS
• Garbage collection
• Salvaged volumes
• FAQ
6.4.1 Overview
When not in use, Domino project files are stored and versioned in the Domino blob store. When a Domino run is
started from a project, the projects files are copied to a Kubernetes persistent volume that is attached to the compute
node and mounted in the run.
6.4.2 Definitions
When a user starts a new job, Domino will broker assignment of a new execution pod to the cluster. This pod will
have an associated PVC which defines for Kubernetes what type of storage it requires. If an idle PV exists matching
the PVC, Kubernetes will mount that PV on the node it assigns to host the pod, and the job or workspace will start. If
an appropriate idle PV does not exist, Kubernetes will create a new PV according to the Storage Class.
When the user completes their workspace or job, the PV data will be written to the Domino File System, and the PV
will be unmounted and sit idle until it is either reused for the user’s next job or garbage collected. By reusing PV’s,
users who are actively working in a project will not need to copy data from the blob store to a PV repeatedly.
A job will only match with either a fresh PV or one previously used by that project. PV’s are not reused between
projects.
Workspace volumes are handled differently than volumes for Jobs. Workspaces are potentially long lived development
environments that users will stop and resume repeatedly without writing data back to the Domino File System each
time. As a result, the PV for the workspace is a similarly long-lived resource that stores the user’s working data.
These workspace PVs are durably associated with the resumable workspace they are initially created for. Each time
that workspace is stopped, the PV is detached and preserved so that it’s available the next time the user starts the
workspace. When the workspace starts again, it reattaches its PV and the user will see all of their working data saved
during the last session.
Only when a user chooses to initiate a sync will the contents of their project files in the workspace PV be written
back to the Domino File System. A resumable workspace PV will only be deleted if the user deletes the associated
workspace.
Since the data in resumable workspace volumes is not automatically written back to the Domino File System, there is
a risk of lost work should the volume be lost or deleted. When Domino is running on AWS, it safeguards against this
by backing up the EBS volume that backs the workspace PV with EBS snapshotting to S3. If you have accidentally
deleted or lost a resumable workspace volume that contains data you want to recover, contact Domino support for
assistance in restoring from the snapshot.
Domino has configurable values to help you tune your cluster to balance performance with cost controls. The more
idle volumes you allow the more likely it is that users will be able to reuse a volume and avoid needing to copy project
files from the blob store. However, this comes at the cost of keeping additional idle PVs.
By default, Domino will:
• Limit the total number of idle PV’s to 32. This can be adjusted by setting the following option in the central
config:
common com.cerebro.domino.computegrid.kubernetes.volume.maxIdle
• Terminate any idle PV that has not been used in a certain number of days. This can be adjusted by setting the
following option in the central config:
common com.cerebro.domino.computegrid.kubernetes.volume.maxAge
This value is expressed in terms of days. The default value is empty, which means unlimited. A value
of 7d will terminate any idle PV after seven days.
In the scenario when a user’s job fails unexpectedly, Domino will preserve the volume so data can be recov-
ered. After a workspace or job ends, claimed PV’s are placed into one of the following states, indicated with the
dominodatalab.com/volume-state label.
• available
If the run ends normally, the underlying PV will be available for future runs.
• salvaged
If the run fails, the underlying PV will not be eligible for reuse, and is held in this state to be salvaged.
Salvaged PV’s will not be reused automatically by the future workspaces or jobs, but can be manually mounted to a
workspace in order to recover work.
By default, Domino will:
• Limit the total number of salvaged PV’s to 64. This can be adjusted by setting the following option in the central
config:
common com.cerebro.domino.computegrid.kubernetes.volume.maxSalvaged
• Terminate any salvaged PV that has not been used in a certain number of days. This can be adjusted by setting
the following option in the central config:
common com.cerebro.domino.computegrid.kubernetes.volume.maxSalvagedAge
The value is expressed in terms of days. The default value is seven days. A value of 14d will
terminate any salvaged PV after fourteen days.
To recover a salavaged volume,
1. Find the PV that was attached to your job or workspace, which will be in the Deployment logs for your job or
workspace.
2. Create a pod attached to the salvaged volume.
3. Recover the files with your most convenient method (scp, AWS CLI, kubectl cp, etc.)
This script will do Step 2 and will provide the appropriate commands in its output. Remember to delete the PVC and
PV, otherwise these resources will continue to be used.
6.4.8 FAQ
How do I change the size of the storage volume for my jobs or workspaces?
You can set the volume size for new PV’s by editing the following central config value:
• Overview
• Creating a scalable node pool in EKS
6.5.1 Overview
Making a new node group available to Domino is as simple as adding new Kubernetes worker nodes with a distinct
dominodatalab.com/node-pool label. You can then reference the value of that label when creating new
hardware tiers to configure Domino to assign executions to those nodes.
See below for an example of creating a scalable node pool in EKS.
This example shows how to create a new node group with eksctl and expose it to the cluster autoscaler as a labeled
Domino node pool.
1. Create a new-nodegroup.yaml file like the one below, and configure it with the properties you want the
new group to have. All values shown with a $ are variables that you should modify.
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: $CLUSTER_NAME
region: $CLUSTER_REGION
nodeGroups:
- name: $GROUP_NAME # this can be any name you choose, it will be part of the
˓→ASG and template name
instanceType: $AWS_INSTANCE_TYPE
minSize: $DMINIMUM_GROUP_SIZE
maxSize: $DESIRED_MAXIMUM_GROUP_SIZE
volumeSize: 400 # important to allow for image caching on Domino workers
availabilityZones: ["$YOUR_CHOICE"] # this should be the same AZ (or the same
˓→multiple AZ's) as your other node pools
ami:
$AMI_ID
labels:
"dominodatalab.com/node-pool": "$NODE_POOL_NAME" # this is the name you'll
˓→reference from Domino
tags:
"k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool
˓→": "$NODE_POOL_NAME"
# "k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu": "true" #
˓→uncomment this line if this pool uses a GPU instance type
2. Once your configuration file describes the group you want to create, run eksctl create nodegroup
--config-file=new-nodegroup.yaml.
3. Take the names of the resulting ASG and add them to the autoscaling.groups section of your domino.
yml installer configuration.
4. Run the Domino installer to update the autoscaler.
5. Create a new hardware tier in Domino that references the new labels.
When finished, you can start Domino executions that use the new Hardware Tier and those executions will be assigned
to nodes in the new group, which will be scaled as configured by the cluster autoscaler.
• Overview
• Temporarily removing a node from service
• Permanently removing a node from service
– Identifying user workloads
– Dealing with long-running workloads
– Dealing with older versions of Kubernetes
• Sample commands for iterating over many nodes and/or pods
6.6.1 Overview
There may be times when you need to remove a specific node (or multiple nodes) from service, either temporarily or
permanently. This may include cases of troubleshooting nodes that are in a bad state, or retiring nodes after an update
to the AMI so that all nodes are using the new AMI.
This page describes how to temporarily prevent new workloads from being assigned to a node, as well as how to safely
remove workloads from a node so that it can be permanently retired.
The kubectl cordon <node> command will prevent any additional pods from being scheduled onto the node,
without disrupting any of the pods currently running on it. For example, let’s say a new node in your cluster has come
up with some problems, and you want to cordon it before launching any new runs to ensure they will not land on that
node. The procedure might look like this:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192-168-0-221.us-east-2.compute.internal Ready <none> 12d v1.14.7-eks-
˓→1861c5
Before removing a node from service permanently, you should ensure there are no workloads still running on it
that should not be disrupted. For example, you might see the following workloads running on a node (notice the
specification of the compute namespace with -n and wide output to include the node hosting the pod with -o):
$ kubectl get po -n domino-compute -o wide | grep ip-192-168-24-46.us-east-2.compute.
˓→internal
Different types of workloads should be treated differently. You can see the details of a particular workload with
kubectl describe po run-5e66acf26437fe0008ca1a88-f95mk -n domino-compute. The la-
bels section of the describe output is particularly useful to distinguish the type of workload, as each of the workloads
named as run-... will have a label like dominodatalab.com/workload-type=<type of workload>.
The example above contains one each of the major user workloads:
• run-5e66acf26437fe0008ca1a88-f95mk is a 4.4.0, with label dominodatalab.com/
workload-type=Batch. It will stop running on its own once it is finished and disappear from the
list of active workloads.
• run-5e66ad066437fe0008ca1a8f-629p9, is a 4.4.0, with label dominodatalab.com/
workload-type=Workspace. It will keep running until the user who launched it shut it down.
You have the option of contacting users to shut down their workspaces, waiting a day or two in the expectation
they will shut them down naturally, or removing the node with the workspaces still running. (The last option
is not recommended unless you are certain there is no un-synced work in any of the workspaces and have
communicated with the users about the interruption.)
• run-5e66b65e9c330f0008f70ab8-85f4f5f58c-m46j7, is an 4.4.0, with label dominodatalab.
com/workload-type=App. It is a long-running process, and is governed by a kubernetes deployment.
It will be recreated automatically if you destroy the node hosting it, but will experience whatever downtime is
required for a new pod to be created and scheduled on another node. See below for methods to proactively move
the pod and reduce downtime.
• model-5e66ad4a9c330f0008f709e4-86bd9597b7-59fd9, is a 4.4.0. It does not have a
dominodatalab.com/workload-type label, and instead is easily identifiable by the pod name. It is
also a long-running process, similar to an app, with similar concerns. See below for methods to proactively
move the pod and reduce downtime.
• domino-build-5e67c9299c330f0008f70ad1 is a 4.4.0. It will finish on its own and go into a
Completed state.
For the long-running workloads governed by a Kubernetes deployment, you can proactively move the pods off of the
cordoned node by running a command like this:
Notice the name of the deployment is the same as the first part of the name of the pod in the above section. You can see
a list of all deployments in the compute namespace by running kubectl get deploy -n domino-compute.
Whether the associated app or model experiences any downtime will depend on the update strategy of the deployment.
For the two example workloads above in a test deployment, one App and one Model API, we have the following
The App in this case would experience some downtime, since the old pod will be terminated immediately (1 max
unavailable with only 1 pod currently running). The model will not experience any downtime since the termina-
tion of the old pod will be forced to wait until a new pod is available (0 max unavailable). If desired, you can
edit the deployments to change these settings and avoid downtime.
Earlier versions of kubernetes do not have the kubectl rollout restart command, but a similar effect can
be achieved by “patching” the deployment with a throwaway annotation like this:
$ kubectl patch deploy run-5e66b65e9c330f0008f70ab8 -n domino-compute -p '{"spec":{
˓→"template":{"metadata":{"annotations":{"migration_date":"'$(date +%Y%m%d)'"}}}}}'
The patching process will respect the same update strategies as the above restart command.
6.6.4 Sample commands for iterating over many nodes and/or pods
In cases where you need to retire many nodes, it can be useful to loop over many nodes and/or workload pods in a
single command. Customizing the output format of kubectl commands, appropriate filtering, and combining with
xargs makes this possible.
For example, to cordon all nodes in the default node pool, you can run the following:
$ kubectl get nodes -l dominodatalab.com/node-pool=default -o custom-columns=:.
˓→metadata.name --no-headers | xargs kubectl cordon
To view only apps running on a particular node, you can filter using the labels discussed above:
$ kubectl get pods -n domino-compute -o wide -l dominodatalab.com/workload-type=App |
˓→grep <node-name>
To do a rolling restart of all model pods (over all nodes), you can run:
$ kubectl get deploy -n domino-compute -o custom-columns=:.metadata.name --no-headers
˓→| grep model | xargs kubectl rollout restart -n domino-compute deploy
(continues on next page)
When constructing such commands for larger maintenance, always run the first part of the command by itself to verify
that the list of names being passed to xargs and to the final kubectl command are what you expect!
Domino uses Keycloak, an enterprise-grade open source authentication service to manage users and logins. Keycloak
runs in a pod in the Domino Platform. There are three modes you can use for identity management in Domino:
1. Local usernames and passwords
2. Identity federation to LDAP / AD
3. Identity brokering to a SAML provider for SSO
145
Domino Admin Docs Documentation, Release 4.4.0
apiVersion: v1
data:
password: <encrypted-password>
kind: Secret
metadata:
creationTimestamp: 2019-09-09T21:23:15Z
labels:
app.kubernetes.io/instance: keycloak
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: keycloak
helm.sh/chart: keycloak-4.14.1-0.10.2
name: keycloak-http
namespace: domino
resourceVersion: "6746"
selfLink: /api/v1/namespaces/domino/secrets/keycloak-http
uid: 09009f96-d348-11e9-9ea1-0aa417381fd6
type: Opaque
Decrypt the password by running echo '<encrypted-password' | base64 --decode. With this pass-
word you will be able to log in to the Keycloak UI as the keycloak administrator user in the master realm. Read the
official Keycloak documentation on the master realm to learn more.
Keycloak will be configured automatically by Domino with a realm named DominoRealm that will be used for
Domino authentication. When reviewing or changing setting for Domino authentication, ensure that you have
DominoRealm selected in the upper left.
The simplest option for authentication to Domino is to use local usernames and passwords. In this case all user
information is stored by Keycloak in the Postgres database, and there is no federation or brokering to other identity
providers.
7.2.1 Configuration
In this mode the key settings are on the Login tab of the DominoRealm settings page.
The one setting on this tab that is not supported is Email as username as that would automatically use email as
username and Domino currently does not support that as a valid username. Note also that if you want to use the
Verify Email option, an SMTP connection must be configured in the Email tab.
You can add, edit, and deactivate local users from the Users menu. Click View all users to load user data.
Keycloak provides the ability to connect to an LDAP / AD identity provider and cache user information.
This can be configured in the User Federation menu. Select ldap from the Add provider. . . dropdown menu.
For details on all available options, read the official Keycloak documentation on User storage federation.
When adding a provider according to those docs, if you are migrating from an older Domino, you can make use of
your existing ldap.conf file on the Domino frontend to see exactly what inputs you should use for the provider
settings. Some of the key pieces of information are:
Group and Role synchronization can be configured with steps similar to those listed for SSO, except that user attributes
must first be imported to Keycloak via an LDAP mapper. Once that is done, and the users in Keycloak have the appro-
priate user attributes specifying group membership or role, the remaining setup (to map from Keycloak to Domino)
will follow the steps in the SSO group and role synchronization related to Client Mappers.
NOTE: updates to a user’s group or role will not fully synchronize to Domino until the user has a login event to
Domino.
In addition to configuring the LDAP connection you may also need to review the LDAP mappers associated with the
LDAP connection you have configured. Some mappers will be configured by based on the LDAP vendor that was
chosen, but you may need to modify these based on the specific configuration of your provider. You will need to make
sure that there are mappers for the following attributes:
• username
• firstName
• lastName
• email
For more details, read the official Keycloak documentation on LDAP mappers.
Domino can integrate with a SAML 2.0 or OIDC identity provider for Single Sign-On (SSO) with the steps outlined
below.
Provide an Alias for the newly created provider. This is a unique name for the provider in Keycloak, and it will
also be part of the Redirect URI used by the provider service to route SAML responses and redirect users following
authentication.
The Redirect URI (case sensitive) will be:
https://<deployment_domain>/auth/realms/DominoRealm/broker/<alias>/endpoint
In an example deployment with domain domino.acme.org and provider alias domino-credentials the URL will be:
https://domino.acme.org/auth/realms/DominoRealm/broker/domino-credentials/
endpoint
Do not save the identity provider entry yet, as you will not be able to import your provider settings once it is saved.
To complete the configuration, you need to create a SAML application in the identity provider that will be integrated
with Domino. To create the application you will need the Redirect URI from the step above.
The specific procedure for creating the SAML endpoint will depend on your identity provider. Domino can integrate
via SAML with Okta, Azure AD, Ping, and any other provider that implements SAML v2.0.
The following are important properties of the SAML endpoint you will create in the provider. After the SAML
endpoint has been created and configured, you should export an XML metadata file you can use to complete the
configuration of the provider in Keycloak.
Controls the format of the <saml2:NameID> element in the SAML Response. This will be used to derive the SSO
username in Domino.
• Option 1: urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress
– Users will be uniquely identified by their email and username will be automatically derived from
it
– Example:
<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-
˓→format:emailAddress">
john.smith@acme.org
</saml2:NameID>
• Option 2: urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified
– The SAML endpoint will need to respond with a string that can be used as the username of the
user without any modification
– Example:
<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-
˓→format:unspecified">
jsmith
</saml2:NameID>
• Option 3: urn:oasis:names:tc:SAML:1.1:nameid-format:persistent
– Typically the SAML endpoint will return a NameID that is a GUID which is not suitable for a username
– If the endpoint must use this format, then an additional attribute containing username must be returned
Assertion attributes
Additional SAML attributes are required to automatically populate the Domino profile. Without these, on first login
the user will be prompted to complete the required elements of their user profile.
The required attributes are:
• First Name
• Last Name
• Email
• Username (if NameId is not email or does not represent user name)
No specific attribute names are expected as these can be mapped in Keycloak.
Additional requirements
You can use the metadata file from the step above to complete configuration of the provider in Keycloak.
You can do this from the bottom of the identity provider configuration page. This is only available before the provider
is saved for the first time.
If you are importing from a file, make sure to click Import after selecting the file. After import, most of the provider
settings will be configured automatically. You can now save the configuration.
Additional settings
This should have been configured on import, but verify that it matches the option configured on the external
endpoint.
• Want Assertions Signed - Yes
• Validate Signature - Yes
The corresponding signature field should already be populated based on the metadata you imported in the pre-
vious step
Additional options like Assertion Encryption and Request Singing are supported, but would require additional con-
figuration coordination between Keycloak and the endpoint in your identity provider.
For more detailed documentation of all supported SAML settings, see Keycloak SAML v2 Identity Providers
Once the provider in Keycloak is saved, an Export tab will appear that contains XML metadata for the provider that
can be used to automatically configure the external endpoint.
The metadata will also be available at:
https://<deployment domain>/auth/realms/DominoRealm/broker/<alias>/endpoint/
descriptor
In order to make the experience of new users signing in for the first time seamless, and not require them to complete
their profile on initial login, you need to make sure that several SAML attributes are being passed back in SAML
responses and that these are correctly mapped to Domino user attributes.
If the attributes are not properly mapped, upon first login users will be prompted to complete missing fields in their
profile.
To map these values from the SAML assertion attributes to the user profile model, you need to configure an Attribute
Importer mapper from the Mappers tab.
Mapping username
The mapper configuration for username depends on how the external endpoint is configured with respect to NameID
Policy options.
• Option 1: urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress
– Use Email Prefix as UserName Importer
– Example:
<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress">
john.smith@acme.org
</saml2:NameID>
Map as shown:
• Option 2: urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified
– No need to do an importer. The username will be mapped automatically to the NameID value
– Example:
<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-
˓→format:unspecified">
jsmith
</saml2:NameID>
• Option 3: urn:oasis:names:tc:SAML:1.1:nameid-format:persistent
– Use Username Template Importer with Template of ${ATTRIBUTE.<attribute Name>}
or ${ATTRIBUTE.<attribute FriendlyName>}
– Example:
<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified">
jsmith
</saml2:NameID>
<saml2:Attribute Name="customUserName">
<saml2:AttributeValue>jsmith</saml2:AttributeValue>
</saml2:Attribute>
Map as shown:
For additional information on attribute mapping, please refer to keycloak documentation for Mapping Claims and
Assertions
When troubleshooting SAML attribute mapping, ideally you will want to have a specification for the SAML response
that your identity provider endpoint will send back to Keycloak following authentication. A thorough specification
will detail the NameID policy formate and attributes being sent in the response.
If such a specification is not available, or the attribute mapping does not function as expected, it may be necessary
to examine an actual SAML response that is returned after a login attempt. One simple way to do this is to use the
SAML-tracer extension available for Chrome and Firefox. It will allow you to examine decoded SAML requests and
responses. By examining a SAML response, you will be able to see the attributes that are being returned and verify
whether attributes are missing or the names or formats are different from what is expected.
To configure the recommended login authentication flow, select the Domino First Broker Login flow (no
dashes):
Typically, when configuring the SAML endpoint that will provide SSO authentication for Domino, the provider ad-
ministrator restricts the endpoint to a subset of users who should be allowed to authenticate through it. This is the
preferred method for restricting access to a subset of users with valid enterprise credentials.
In rare cases, where limitations in the provider software don’t allow you to constrain the set of users who can authen-
ticate against the endpoint, the provider will need to pass an additional SAML attribute which specifies if a user is
allowed to access Domino or not. The value of that attribute will depend on a specific rule for each user. Usually, it
will be based on membership in a particular group in your identity provider.
The following should be used as a last resort if all identity provider restriction options are exhausted.
Prerequisites
There must be an attribute that indicates whether a properly authenticated user should be allowed to log in to Domino.
• AttributeName:
– Suggested: rolesForDomino
– Could be anything as this can be mapped
• Multi-valued: Yes
• Value:
– Contains one or more values that could be used for gating access. Typically would be roles or groups.
Attribute mapper
<saml2:Attribute Name="rolesForDomino">
<saml2:AttributeValue>dave-users</saml2:AttributeValue>
<saml2:AttributeValue>it-users</saml2:AttributeValue>
</saml2:Attribute>
Before modifying the default Domino First Broker Login flow, you should first make a copy of it.
Move the new Script entry to be immediately after the Create User If Unique execution
Use the following script and modify the attribute value as needed.
/*
* Template for JavaScript based authenticator's.
* See org.keycloak.authentication.authenticators.browser.
˓→ScriptBasedAuthenticatorFactory
*/
/**
* An example authenticate function.
(continues on next page)
LOG.info(script.name +
" trace script auth for: " +
user.username);
When using the default domino-theme in Keycloak, each identity provider has a display text field that can be edited.
This display text will show up on the SSO button for that identity provider. If Display text is blank or equal to the
alias value, the button will display the default text Continue with Single Sign On. If any text other than the
value of the Alias field is used, that value becomes the text on the button.
If you encounter errors from the Keycloak service while attempting an SSO login, you can view the Keycloak request
logs via kubectl by running kubectl -n <domino-platform-namespace> logs keycloak-0.
By default, sessions are limited to 60 days but can be configured differently as needed.
See the Keycloak documentation for more information on timeouts.
Overview
If you have enabled SSO for Domino, you can optionally configure AWS credential propagation, which allows for
Domino to automatically assume temporary credentials for AWS roles that are based on roles assigned to users in the
upstream identity provider. Below is a reference for the overall workflow from user login to credential usage.
1. The Identity Provider Relying Party/Application validates that the Issuer element in the AuthnRequest
(SAML request) sent by Domino
2. Domino validates the Audience (Entity ID of the SP) in the SAML Response sent by the Identity Provider
Relying Party/Application
3. AWS AssumeRole validates that the Issuer of the SAML Response passed on from Domino matches the
Issuer of the Identity Provider Relying Party/Application. You can set up additional validations also (e.g.
validating the Audience)
The following central configuration settings need to be set as shown to enable credential propagation. These can be
found or added by a Domino administrator by clicking Advanced > Central Config from the administration UI.
• Key: com.cerebro.domino.auth.aws.sts.enabled
Value: true
• Key: com.cerebro.domino.auth.aws.sts.region
Value: Short AWS region name where your Domino is deployed, such as us-west-2
• Key: com.cerebro.domino.auth.aws.sts.defaultSessionDuration
Remember to restart the services with the link at the top of the central configuration page for these settings to take
effect.
You need to have federation between your AWS account and your identity provider configured independent of Domino.
As an example see AWS Federated Authentication with Active Directory Federation Services (AD FS)
The SAML provider application connected to Domino needs include the appropriate AWS federation attributes based
on the roles that each user will be allowed to assume.
Since Domino will refresh the user’s credentials during an active session, you must ensure that any IAM role that you
propagate to a user has assume-self policy.
For example:
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "<ARN for the role>"
}
}
* Duration (in seconds) of how long the initial set of credentials for each of the roles is valid before the
user will need to login again
* The duration must be smaller than the maximum allowable duration for each of the roles made avail-
able for a given user
Before proceeding, it’s useful to check that your SAML attributes appeared in your SAML response when logging
into Domino. This will help validated that you’ve correctly established trust between AWS and your IDP. One simple
way to do this is to use the SAML-tracer extension available for Chrome and Firefox. It will allow you to examine
decoded SAML requests and responses to see that the appropriate attributes appear.
Example:
<saml2:AttributeStatement xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/Role">
<saml2:AttributeValue xsi:type="xs:string">
arn:aws:iam::123456789012:saml-provider/acme-saml,
˓→arn:aws:iam::123456789012:role/role1
</saml2:AttributeValue>
<saml2:AttributeValue xsi:type="xs:string">
arn:aws:iam::123456789012:saml-provider/acme-saml,
˓→arn:aws:iam::123456789012:role/role2
</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/RoleSessionName">
<saml2:AttributeValue xsi:type="xs:string">
john.smith@acme.org
</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/SessionDuration">
<saml2:AttributeValue xsi:type="xs:string">
900
</saml2:AttributeValue>
</saml2:Attribute>
</saml2:AttributeStatement>
To map the appropriate values from the SAML assertion, you need to configure an Attribute Importer mapper from
the Mappers tab for the following attributes.
• AWS Roles
– Name: AWS Roles
– Mapper Type: Attribute Importer
– Attribute Name: https://aws.amazon.com/SAML/Attributes/Role
– Friendly Name: <blank>
– User Attribute Name: Must be aws-roles
• AWS Role Session Name
– Name: AWS Role Session Name
– Mapper Type: Attribute Importer
– Attribute Name: https://aws.amazon.com/SAML/Attributes/RoleSessionName
– Friendly Name: <blank>
– User Attribute Name: Must be aws-role-session-name
• AWS Session Duration
– Name: AWS Session Duration
– Mapper Type: Attribute Importer
– Attribute Name: https://aws.amazon.com/SAML/Attributes/SessionDuration
– Friendly Name: <blank>
– User Attribute Name: Must be aws-session-duration
In order to give Domino access to users SAML assertions, you need to enable the following settings from the identity
provider:
• Store Tokens: On
• Store Tokens Readable: On
Domino-client configuration
The domino-play OIDC client is pre-populated on installation with client mappers, so that IdP mapped SAML infor-
mation will flow into Domino.
1. Go to the Clients tab in the DominoRealm and select the domino-play client
1. Create a new mapper with type User Session Note and the following settings:
• Name: identity-provider-mapper
• Mapper Type: User Session Note
• User Session Note: identity_provider
• Token Claim Name: idpbroker
• Claim JSON Type: string
• Add to ID token: On
• Add to access token: On
Usage
Once configured properly the first time, you will need to log out and log back into Domino.
To confirm that credentials are propagating correctly to users, start a workspace and check the environment vari-
able AWS_SHARED_CREDENTIALS_FILE and that your credential file appears at /var/lib/domino/home/.
aws/credentials.
This should be sufficient for a user to connect to AWS resources without further configuration. Click here to see an
example of connecting to s3.
Learn more about using a credential file with AWS SDK.
To test your configuration outside of Domino, perform an AssumeRoleWithSAML call successfully using the SAML
token provided to Domino by your IdP.
Example:
Domino supports synchronizing Domino administrative user roles and organization membership with attributes in your
SAML identity provider. This allows management of these roles and memberhsips to be externalized to the identity
provider.
Prerequisite
Your SAML provider application connected to Domino must include group membership as a multi-valued attribute.
Enabling this feature requires that the following Domino central configuration setting is set as follows:
• Key: authentication.oidc.externalOrgsEnabled
Value: true
Remember that Domino services need to be restarted for this setting to take effect.
Attribute mapper
<saml2:Attribute Name="UserGroups">
<saml2:AttributeValue>nyc-data-scientists</saml2:AttributeValue>
<saml2:AttributeValue>all-data-scientists</saml2:AttributeValue>
<saml2:AttributeValue>sensitive-claims-users</saml2:AttributeValue>
</saml2:Attribute>
By default, the domino-group-mapper client mapper is created upon installation. To review it, go to the Clients
tab in the DominoRealm in Keycloak, and select the domino-play client:
The domino-group-mapper mapper will be present in the default client mappers listed:
Role synchronization
In addition to automatically configuring group membership, it is also possible to automatically assign Domino admin-
istrative and/or user roles to users based on attributes from your SAML identity provider.
Prerequisite
The SAML identity provider application connected to Domino must include attributes that can be mapped to specific
Domino roles.
Central configuration
Enabling this feature requires that the following Domino central configuration setting is set as follows:
• Key: authentication.oidc.externalRolesEnabled
Value: true
Remember that Domino services need to be restarted for this setting to take effect.
Attribute mapper
By default, the domino-system-roles client mapper is created upon installation. To review it, go to the Clients
tab in the DominoRealm in Keycloak and select the domino-play client.
The domino-system-roles mapper will be present in the default client mappers listed:
This section covers the SAML attributes expected by Domino to enable different pieces of functionality.
SSO attributes
The following are required to establish single sign-on between Domino and your identity provider:
• Username
– NameID (In Subject element)
– Preferred format: urn:oasis:names:tc:SAML:1.1:nameid-format:email
• First Name
– Attribute name: Can be any name since Domino allows attribute mapping
• Last Name
– Attribute name: Can be any name since Domino allows attribute mapping
• Email
– Attribute name: Can be any name since Domino allows attribute mapping
Example:
<saml2:Subject xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:email">
john.smith@acme.org
</saml2:NameID>
...
</saml2:Subject>
<saml2:AttributeStatement xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:Attribute Name="DominoEmail">
<saml2:AttributeValue xsi:type="xs:string">
john.smith@acme.org
</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="DominoFirstName">
<saml2:AttributeValue xsi:type="xs:string">
John
</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="DominoLastName">
<saml2:AttributeValue xsi:type="xs:string">
Smith
</saml2:AttributeValue>
</saml2:Attribute>
</saml2:AttributeStatement>
The following attributes are optional but required if you are using the credential propagation functionality of Domino.
In this case, the following additional attributes are required.
• AWS Roles
– Attribute Name: https://aws.amazon.com/SAML/Attributes/Role
– Multi-valued: Yes
– Value format:
* Duration (in seconds) of how long the initial set of credentials for each of the roles is valid before the
user will need to login again
* The duration must be smaller than the maximum allowable duration for each of the roles made avail-
able for a given user
Example:
<saml2:AttributeStatement xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/Role">
<saml2:AttributeValue xsi:type="xs:string">
arn:aws:iam::123456789012:saml-provider/acme-saml,
˓→arn:aws:iam::123456789012:role/role1
</saml2:AttributeValue>
<saml2:AttributeValue xsi:type="xs:string">
arn:aws:iam::123456789012:saml-provider/acme-saml,
˓→arn:aws:iam::123456789012:role/role2
</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/RoleSessionName">
<saml2:AttributeValue xsi:type="xs:string">
john.smith@acme.org
</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/SessionDuration">
(continues on next page)
The following attributes are required if you are using group synchronization functionality in Domino.
In this case, the following additional attributes are required:
• Domino Organizations
– Name: Can be any name since Domino can do attribute mapping
– Multi-valued: Yes
– Values:
* One or more of the groups of which the user is a member in your centralized identity provider. For
any groups specified here, the user will be automatically enrolled in a Domino organization with the
same name
Example:
<saml2:AttributeStatement xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:Attribute Name="DominoOrganizations">
<saml2:AttributeValue>nyc-data-scientists</saml2:AttributeValue>
<saml2:AttributeValue>all-data-scientists</saml2:AttributeValue>
<saml2:AttributeValue>sensitive-claims-users</saml2:AttributeValue>
</saml2:Attribute>
</saml2:AttributeStatement>
The following attributes are required if you are using administrative role synchronization functionality in Domino.
In this case, the following additional attributes are required:
• Domino System Roles
– Name: Can be any name since Domino can do attribute mapping
– Multi-valued: Yes
– Values:
* One or more values that is an exact, case-sensitive match to one of the Domino administrative roles
· Practitioner
· SysAdmin
· Librarian
· ReadOnlySupportStaff
· SupportStaff
· ProjectManager
Example:
<saml2:AttributeStatement xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:Attribute Name="DominoSystemRoles">
<saml2:AttributeValue xsi:type="xs:string">
SysAdmin
</saml2:AttributeValue>
<saml2:AttributeValue xsi:type="xs:string">
Librarian
</saml2:AttributeValue>
</saml2:Attribute>
</saml2:AttributeStatement>
Operations
This section contains information for IT and site reliability operations on how to measure, understand, and manage the
health of a deployed Domino application.
Domino runs in Kubernetes, which is an orchestration framework for containerized applications. In this model there
are three distinct layers with their own relevant metrics:
1. Domino application
This is the top layer, representing Domino application components running in containers that are deployed and
managed by Kubernetes. The content in this guide focuses on operations in this layer.
2. Kubernetes cluster
This is the Kubernetes software-defined hardware abstraction and orchestration system that manages the deploy-
ment and lifecycle of Domino application components. Cluster operations are handled a layer below Domino,
but do need to take into account the Domino architecture and cluster requirements. For guidance on general
cluster administration, consult the official Kubernetes documentation.
3. Host infrastructure
This is the bottom layer, representing the virtual or physical host machines that are doing work as nodes in the
Kubernetes cluster. Operations in this layer, including management of computing and storage resources as well
as OS patching, are the responsibility of the IT owners of the infrastructure. Domino does not have any unique
or unusual requirements in this layer.
187
Domino Admin Docs Documentation, Release 4.4.0
These are the logs output by user code running in Domino as a Job, Workspace, App, or Model API. These are
available in the Domino web application on the Jobs Dashboard, Workspaces Dashboard, App Dashboard, and Model
API instance logs. This data is a key part of the Domino reproducibility model, and is kept indefinitely in the Domino
blob store.
The system these logs are written to is defined in the installation configuration file at blob_storage.logs.
All Domino services output their logs using the standard Kubernetes logging architecture. Relevant logs are printed to
stdout or stderr as indicated, and are captured by Kubernetes.
For example, to look at your front end logs you could do the following:
1. List your all namespaces to find the name of you platform namespace
kubectl get namespace
2. List all the pods in your platform namespace to find the name of a front end. Keep in mind you likely have more
than one front end pod.
kubectl get pods -n <namespace for you platform nodes>
3. Print the front ends logs for one of your front ends
kubectl logs <pod name of your front end pod> -n <namespace for you platform
nodes> -c nucleus-frontend
The most effective way to aggregate logs is to attach a Kubernetes log aggregation utility to monitor the following
Kubernetes namespaces used by Domino:
• Platform namespace
This namespace hosts the core application components of the Domino application, including API servers,
databases, and web interfaces. The name of this namespace is defined in the installer configuration file at
namespaces.platform.name.
The following components running in this namespace produce the most important logs:
Component Logs
nucleus-frontend The nucleus-frontend pods host the frontend API server that routes all
requests to the Domino application. Its logs will contain details on
HTTP requests to Domino from the application or another API client.
If you see errors in Domino with HTTP error codes like 500, 504, or
401, you can find corresponding logs here.
nucleus-dispatcher The nucleus-dispatcher pod hosts the Domino scheduling and brokering
service that sends user execution pods to Kubernetes for deployment.
Errors in communication between Domino and Kubernetes will result
in corresponding logs from this service.
keycloak The keycloak pods hosts the Domino authentication service. The logs
for this service will contain a record of authentication events, including
additional details on any errors.
cluster-autoscaler This pod hosts the open-source Kubernetes cluster autoscaler, which
controls and manages autoscaling resources. The logs for this service
will contain records of scaling events, both scaling up new nodes in re-
sponse to demand and scaling down idle resources, including additional
details on any errors.
Monitoring Domino involves tracking several key application metrics. These metrics reveal the health of the applica-
tion and can provide advance warning of any issues or failures of Domino components.
8.2.1 Metrics
There are many application monitoring tools you can use to track these metrics, including:
• NewRelic
• Splunk
• Datadog
8.2.2 Alerting
Users are advised to configure alerts to their application administrators if the thresholds listed above are exceeded.
These alerts are an indication of potential resourcing issues or unusual usage patterns worth investigation. Refer to
the Domino application logs, the Domino administration UI, and the Domino Control Center to gather additional
information.
Domino runs in Kubernetes, which is an orchestration framework for delivering applications to a distributed compute
cluster. The Domino application runs two types of workloads in Kubernetes, and there are different principles to sizing
infrastructure for each:
• Domino Platform
These always-on components provide user interfaces, the Domino API server, orchestration, metadata and sup-
porting services. The standard architecture runs the platform on a stable set of three nodes for high availability,
and the capabilities of the platform are principally managed through vertical scaling, which means changing
the CPU and memory resources available on those platform nodes and changing the resources requested by the
platform components.
• Domino Compute
These on-demand components run users’ data science, engineering, and machine learning workflows. Compute
workloads run on customizable collections of nodes organized into node pools. The number of these nodes can
be variable and elastic, and the capabilities are principally managed through horizontal scaling, which means
changing the number of nodes. However, when there are more resources present on compute nodes, they can
handle additional workloads, and therefore there are benefits to vertical scaling.
The resources available to the Domino Platform will determine how much concurrent work the application can handle.
This is the primary capability of Domino that is limited by vertical scale. To increase the capacity, key components
must have access to additional CPU and memory.
The default size for the Domino Platform is three nodes, with 8 CPU cores and 32GB memory each, for a total of
24 CPU cores and 96GB of memory. Those resources are available to the collective of Platform services, and each
service claims some resources via Kubernetes resource requests.
The capabilities of that default size are shown below, along with options for alternative sizing.
Domino recommends assuming a baseline maximum number of workloads equal to 50% of the number of total
Domino users, expressed as a _concurrency_ of 50%. However, different teams and organizations may have dif-
ferent usage patterns in Domino. For teams that regularly run batches of many executions at once, it may be necessary
to size Domino to support a concurrency of 100%, or even 200%.
The following practices can maximize the capabilities of a Platform with a given size.
• Cache frequently used Domino environments in the AMI used for your Compute Nodes. This reduces load on
the Platform Docker registry.
• Optimize your hardware tiers and node sizes to fit many workloads in tidy groups. Each additional node runs
message brokers, logging agents, and adds load to Platform services that process queues from the Compute
Grid. The Platform can handle more concurrent executions by running more executions on fewer nodes.
• Parallelize your tasks by running your workload on many cores of one large node, rather than by chunking tasks
into multiple workloads across multiple nodes. This reduces the total number of nodes being managed, and
thereby reduces load on the Domino platform.
Data management
• Overview
• About Domino project files
– How is the data in project files stored?
– Who can access the data in project files?
• About Domino Datasets
– How is the data in Domino Datasets stored?
– Who can access the data in Domino Datasets?
• Integrating Domino with other data stores and databases
• Tracking and auditing data interactions in Domino
9.1.1 Overview
This article describes how Domino stores and handles data that users upload, import, or create in Domino. There are
two systems that store data in Domino:
193
Domino Admin Docs Documentation, Release 4.4.0
Work in Domino happens in projects. Every Domino project has a corresponding collection of project files. While at
rest, project files are stored in a durable object storage system, referred to as the Domino Blob Store. This can be a
cloud service like Amazon S3, or it can be an on-premises Network Attached Storage (NAS) system.
When a user starts a Run in Domino, the files from his or her project are fetched from the Blob Store and loaded
into the Run in the working directory of the Domino service filesystem. When the Run finishes, or the user initiates
a manual sync in an interactive Workspace session, any changes to the contents of the working directory are written
back to Domino as a new revision of the project files. Domino’s versioning system tracks file-level changes and can
provide rich file difference information between revisions.
Domino also has several features that provide users with easy paths to quickly initiating a file sync. The following
events in Domino can trigger a file sync, and the subsequent creation of a new revision of a project’s files.
• User uploads files from the Domino web application upload interface
• User authors or edits a file in the Domino web application file editor
• User syncs their local files to Domino from the Domino Command Line Interface
• User uploads files to Domino via the Domino API
• User executes code in a Domino Job that writes files to the working directory
• User writes files to the working directory during an interactive Workspace session, and then initiates a manual
sync or chooses to commit those files when the session finishes
All revisions of project files that Domino creates are kept forever, since project files are a component in the Domino
Reproducibility Engine. It is always possible to return to and work with past revisions of project files.
While users are generally unable to permanently delete data from Domino project files, administrators do have the
capability to delete specific files by directly editing the contents of the blob store. This is an invasive process and not
recommended for day-to-day activity.
Users can read and write files to the projects they create, on which they automatically are granted an Owner role.
Owners can add collaborators to their projects with the following additional roles and associated files permissions.
• Contributor
Can read and write project files.
• Results Consumer
Can read project files.
• Launcher User
Cannot access project files.
• Project Importer
Can access files made available for export.
The permissions available to each role are described in more detail in Sharing and collaboration.
Users can also inherit roles from membership in Domino Organizations. Learn more in the Organizations overview.
Domino users with administrative roles are granted additional access to project files across the Domino deployment
they administer. Learn more in Admin roles.
When users have large quantities of data, including collections of many files and large individual files, Domino rec-
ommends storing the data in a Domino Dataset. Datasets are collections of Snapshots, where each Snapshot is an
immutable image of a filesystem directory from the time when the Snapshot was created.
These directories are stored in a network filesystem like Amazon EFS or a local NFS, and can be attached to Domino
Runs for read-only use without transferring their contents into the Domino service filesystem. This allows users to
quickly start working on big data in Domino.
Each Snapshot of a Domino Dataset is an independent state, and its membership in a Dataset is an organizational
convenience for working on, sharing, and permissioning related data. Domino supports running scheduled Jobs that
create Snapshots, enabling users to write or import data into a Dataset as part of an ongoing pipeline.
Unlike project files, Dataset Snapshots can be permanently deleted by Domino system administrators. Snapshot
deletion is designed as a two-step process to avoid data loss, where users mark Snapshots they believe can be deleted,
and admins then confirm the deletion if appropriate. This permanent deletion capability makes Datasets the right
choice for storing data in Domino that has regulatory requirements for expiration.
Datasets in Domino belong to projects, and access is afforded accordingly to users who have been granted roles on
the containing project. Owners can mount Snapshots from Datasets in the project for read access, they can write new
Snapshots, and they can add collaborators with the following roles.
• Contributor
Can mount Datasets for read access and write new Snapshots.
• Results Consumer
Cannot read from Datasets or write new Snapshots.
• Launcher User
Cannot read from Datasets or write new Snapshots.
• Project Importer
Can mount Datasets for read access.
The permissions available to each role are described in more detail in Sharing and collaboration.
Users can also inherit roles from membership in Domino Organizations. Learn more in the Organizations overview.
Domino users with administrative roles are granted additional access to Datasets across the Domino deployment they
administer. Learn more in Admin roles.
Domino can be configured to connect to external data stores and databases. This process involves loading the re-
quired client software and drivers for the external service into a Domino environment, and loading any credentials or
connection details into Domino environment variables. Users can then interact with the external service in their Runs.
Users can import data from the external service into their project files by writing the data to the working directory of
the Domino service filesystem, and they can write data from the external service to Dataset Snapshots. Alternatively, it
is possible to construct workflows in Domino that save no data to Domino itself, but instead pull data from an external
service, do work on the data, then push it to an external service.
Learn more in the Data sources overview and read our detailed Data source connection guides.
Domino system administrators can set up audit logs for user activity in the platform. These logs record events whenever
users:
• Create files
• Edit files
• Upload files
• View files
• Sync file changes from a Run
• Mount Dataset Snapshots
• Write Dataset Snapshots
This list is not exhaustive, and will expand as Domino adds new features and capabilities.
Domino administrators can contact support@dominodatalab.comfor assistance enabling, accessing, and processing
these logs.
There are three ways for data to flow in and out of a Domino Run.
Each Domino Run takes place in a project, and the files for the active revision of the project are automatically loaded
into the local execution volume for a Job or Workspace according to the specifications of the Domino Service Filesys-
tem. These files are retrieved from the Domino File Store, and any changes to these files are written back to the
Domino File Store as a new revision of the project’s files.
Domino Runs may optionally be configured to mount Domino Datasets for input or output. Datasets are network
volumes mounted in the execution environment. Mounting an input Dataset allows for a Job or Workspace to both
start quickly and have access to large quantities of data, since the data is not transferred to the local execution volume
until user code performs read operations from the mounted volume. Any data written to an output Dataset is saved by
Domino as a new snapshot.
User code running in Domino can use third party drivers and packages to interact with any external databases, APIs,
and file systems that the Domino-hosting cluster can connect to. Users can read and write from these external systems,
and they can import data into Domino from such systems by saving files to their project or writing files to an output
Dataset.
The diagram below shows the series of operations that happens when a user starts a Job or Workspace in Domino, and
illustrates when and how various data systems can be used.
• Overview
• Setting up Kubernetes PV and PVC
• Registering external data volumes
• Viewing registered external data volume details
• Editing registered external data volumes
• Unregistering external data volumes
• Configuring censorship
9.3.1 Overview
You can access the External Data Volumes (EDV) administration screen by going to the Domino administration page
and navigating to External Data Volumes: Data -> External Volumes
External data volumes must be registered with Domino before they can be used. All registered external data volumes
appear in a standard table, which display the EDV name, type, description, and volume access (see Volume Properties).
In addition, for each registered EDV, the Projects column indicate which projects had added the EDV.
Unless otherwise specified, all the following actions assume you are in the EDV administration page.
Note: We assume the set up of Kubernetes persistent volumes (PV) and persistent volume claims (PVC) is done by a
Kubernetes administrator.
Domino runs on a Kubernetes cluster and EDVs must be backed by an underlying Kubernetes persistent volume (PV).
More importantly, that persistent volume must be bounded to a properly labelled persistent volume claim (PVC). Here
is an example PV yaml file:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-nfs
spec:
accessModes:
(continues on next page)
The creation of the PVC must include the label with a key dominodatalab.com/external-data-volume.
The value of that key represents the type of external data volume. Currently, NFS is the only supported value. Finally,
the PVC must be created in the Domino compute namespace. Here is an example PVC yaml file:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-nfs
namespace: default
labels:
"dominodatalab.com/external-data-volume": "NFS"
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 30Gi
volumeName: pv-nfs
All properly labelled PVCs will be available candidates to register in the Domino EDV administration user interface.
To register an EDV with Domino, click the Register External Volume button on the upper right hand size of the EDV
administration page. This will open a modal with the EDV registration wizard. The wizard will guide administrators
to registering the EDV by configuring various EDV properties (see Volume Properties).
1. Volume
The first step in the wizard is to select the volume type. Currently, NFS is only supported volume type.
The Available Volumes list will show all candidate volumes of the selected type. The name of these volumes is
the name of the backing Kubernetes persistent volume claim (PVC).
2. Configuration
The second step in the wizard is to configure the volume.
• Name. (Required). This field will default to the selected PVC name that was selected, but can be changeed.
A good practice is to name EDV such that it is recognized by users based on the supporting use case or
some organization defined convention.
• Mount Path. (Required). This specifies the relative mount path for the EDV for supported executions.
This field will default to the selected PVC name that was selected, but can be changeed. This field must
be unique to all registered EDVs. There are a few reserved words. See Volume Properties.
• Mount as read-only. This checkbox specifies the mount type—whether the EDV is mounted by as read-
only or read-write. Default is read-only (checked). Note that this is enforced at the Domino layer. More
restrictive access controls at the Kubernetes or NFS layer overrule this setting. For example, if the PVC
access mode is set to read only, it does not matter this field allows for read-write; the underlying permission
of read only will be enforced.
• Description. Admin defined description for EDV.
3. Access
The third step in the wizard is to define the volume access. See Volume Properties and Authorization.
• Everyone. Allow EDV access to all logged-in users.
• Specific users or organizations. Limit EDV access to specific users and organizations.
Note: Regardless of the setting here, Domino Administrators (SysAdmin) will always be able to access any external
data volume.
To view a registered EDV details, click on the Name of the EDV in the admin table.
To edit the details of a registered EDV, click on the vertical three dots on the right-hand side of its entry in the admin
EDV table. This will expose the Edit action. Click Edit to edit the EDV details.
A modal with editable fields appear where users can change EDV properties.
To unregister an EDV, click on the vertical three dots on the right-hand side of its entry in the admin EDV table. This
will expose the Unregister action. Click Unregister to unregister the EDV.
A confirmation modal appears where users can confirm the unregistration by clicking Unregister, or cancel out of the
operation altogether by clicking Cancel.
Multiple users collaborating on the same project may not all have the same level of volume access. EDVs added to the
project should not be accessible to users without volume access, and under no circumstance will a user without volume
access to an EDV be able to mount that EDV in a supported execution. However, we offer options to manage the
visibility of the EDV in the user interface with two levels of censorship. The levels of censorship allow administrators
to choose between security and discoverability needs.
• Full censorship. Only the existence of any inaccessibe EDV is made known to the user; the quantity and any
metadata (such as name or description) is not made known to the user. This is the level for those that want the
highest level of security.
• Inactive censorship. Inaccessible EDVs are made known to the user; the EDV metadata (such as name and
description) is made known to the user. This is the level that promotes discoverability. With discoverability,
users cana escalate to Domino administrators to gain volume access. This is the default level of censorship.
The level of censorship is configured by a feature flag: ShortLived.ExternalDataVolumesFullCensor.
Key: ShortLived.ExternalDataVolumesFullCensor
Value: boolean
Default: false
• Overview
• Accessing the Datasets administration interface
• Monitoring Datasets usage
• Setting limits on Datasets usage
• Deleting Snapshots from Datasets
9.4.1 Overview
Domino administrators have four important responsibilities when managing Domino Datasets:
1. periodically check the Datasets administration interface
2. monitor and track storage consumption
3. set limits on usage per-Dataset
4. handle deletion of Dataset snapshots
To access the Datasets administration interface, click Admin from the Domino main menu to open the Admin home,
then click Advanced > Datasets.
The Datasets administration page shows important information about Datasets usage in your deployment. At the top
of the interface is a display that shows:
• total storage size used by all stored Snapshots
• the size of all storage used by Snapshots marked for deletion
Below that display is a table of all Snapshots from the history of the deployment. This table can be sorted by Snapshot
status, size, and the name of the containing Dataset.
There are two important central configuration options administrators can use to limit the growth of storage consump-
tion by Datasets.
Namespace: common
Key: com.cerebro.domino.dataset.quota.maxActiveSnapshotsPerDataset
Value: number
Default: 20
This option controls the maximum number of active Snapshots that may
be stored in a Dataset. Snapshots marked for deletion are not active
and do not count against this limit.
Namespace: common
Key: com.cerebro.domino.dataset.quota.maxStoredSnapshotsPerDataset
Value: number
Default: 20
This option controls the total number of Snapshots of any status that
may be stored in a Dataset.
If a Dataset reaches one of these limits, attempting to start a run with a Dataset configuration that could output a new
Snapshot will result in an error message. Before additional Snapshots can be written, you will need to delete old
snapshots or increase the limit.
Administrators can authorize individual projects to ignore these limits with an option in the Hardware & environment
tab of the project settings.
Administrators can delete individual Snapshots at any time with the Delete button at the end of the row representing
the Snapshot in the Datasets administration UI. Clicking this button will open a confirmation dialog, and if you choose
to confirm, the Snapshot will be permanently deleted.
To avoid losing user data, Domino recommends following a two-step process for Snapshot deletion, where the user
who owns the Dataset marks a Snapshot for deletion, and then an administrator takes action to delete the Snapshot if
reasonable. Non-administrator users can never permanently delete Snapshots on their own.
From the Datasets administration UI, you’ll find a button you can click to Delete all marked snapshots, and you can
also sort the table of Snapshots by status to find and examine all Snapshots that have been marked for deletion.
Domino Data Lab is able to assist its customers in their obligations as data controllers under GDPR. This article covers
how to submit requests to Domino, and the information required from the customer to action the request. Because
Domino does not systematically access the contents of files uploaded, requests either need to reference specific users,
or specific files.
1. User Deletion Domino is able to purge personal data about the name, email address and IP address of users of
Domino if required. To process a request, you will need to provide:
• The user account name
• A substitute user to inherit any owned projects and files
2. File versions request Domino is able to provide the hash of all files in a version chain, and optionally access to
those files as well. To process a request, you will need to provide:
• A text file with the username, project name, and file path in this format: user-
name/file_path_1/file_name.csv.
3. List of projects referencing a file Domino is able to provide the list of all projects which reference a specific file
version in Domino. This is useful to identify potential impacts of changing a source file or version. To process
a request, you will need to provide:
• A text file with the username, project name, and file path in this format: user-
name/file_path_1/file_name.csv.
4. File deletion or substitution request To process a request, you will need to have provided:
• A text file with the username, project name, and file path needing deletion or substitution in this format:
username/file_path_1/file_name.csv.
• A text file with the username, project name, and file path to substitute (if applicable) in this format: user-
name/file_path_1/file_name.csv.
After any GDPR request, Domino will provide the customer evidence of the actions carried out but will no longer be
able to see the data in the system.
Most customers don’t encounter a need to have data deleted or returned through the course of doing business. Should
you need to for GDPR or other reasons, note that this may impact your history for reproducibility or auditability.
Domino does not accept any responsibility for identifying derived data from files, nor ensuring the stability of projects
or other work referencing a user or file altered in a request.
User management
10.1 Roles
Administrator’s of Domino can assign roles to users. These roles can be set manually via the UI or they can be mapped
in from your identity provider if you have SSO integration enabled.
The available roles are:
• Practitioner
• SysAdmin
• ProjectManager
• Librarian
• SupportStaff
• ReadOnlySupportStaff
Users with no role are treated as a LightUser, have restricted feature access, and may have a different licensing status.
A SysAdmin user can grant access roles to other users. To do so, open Users tab of the admin UI. Locate the user you
want to grant permissions to, click Edit next to the username, then select the desired role.
Users can have more than one role, and will have the additive permissions of each role.
By default, all new users will be assigned the Practitioner role, but this can be changed with central configuration
options.
215
Domino Admin Docs Documentation, Release 4.4.0
When Project Managers are members of organizations, their role grants them owner-level access to all projects that
are owned by other members of the organizations. This allows the Project Manager to see these projects and their
assets in the Projects Portfolio and Assets Portfolio.
Note that the Project Manager may also have the ability to add users to these organizations, thereby gaining contributor
access to those users’ projects. For this reason, Project Manager should be treated as a highly privileged role, similar
to System Administrator.
• Overview
• Tracking user license types
• Generating user activity reports
10.2.1 Overview
Administrators can use configurable thresholds to track user behavior across the platform for the purposes of identi-
fying users who are taking up a Domino license. Users who access Domino only to consume data science products,
view results, and run Launchers are not counted as taking up a practitioner license.
Once a user performs a data science workflow like starting a Run or publishing a Model, the user will be considered a
practitioner for the purposes of licensing.
To view user information and identify users who are taking up a license, open the Admin interface by clicking Admin
at the bottom of the main menu, then click Users.
The same data on license types, practitioner workloads, and recent activity that is shown in the Users table is available
as a downloadable CSV report. To generate a report manually, from the Admin interface click Advanced > User
Activity Report.
Namespace: common
Key: com.cerebro.domino.Usage.ReportRecipients
Value: comma-separated list of email addresses to receive automated reports
Default: empty
Namespace: common
Key: com.cerebro.domino.Usage.RecentUsageDays
Value: number of days back to set as the threshold for recent activity
Default: 30
Namespace: common
Key: com.cerebro.domino.Usage.ReportFrequency
Value: cron string for how often to send usage reports
Default: 0 0 2 * * ? (daily at 02:00)
Environments
• Overview
• Best practices
• How to clean up your catalog of environments
– Look at current environment usage
– Plan changes to global environments
– Sunset deprecated and unused environments
11.1.1 Overview
This document covers best practices for compute environment management. As a Domino admin, you will have the
power and responsibility to curate the environments used by your organization. A proactive approach to environment
management can prevent sprawl, avoid repetition in environment creation, and equip users with the tools they need to
succeed in Domino.
As an admin, your objective is to find a balance between giving users the freedom to be agile in development, while
also maintaining enough control that you don’t end up with duplicate or unnecessary environments. Admins and users
223
Domino Admin Docs Documentation, Release 4.4.0
are able to create arbitrary number of environments in Domino. You’ll want to manage their creation so that you don’t
end up with dozens of global environments and hundreds of user environments, which can make it hard for users to
know which environments to use, and hard for you as an admin to maintain.
Don’t let your Domino look like this:
# This is a comment that could provide a helpful description of the code to follow
RUN echo 'This is an executed Dockerfile instruction'
Do yourself and future colleagues a favor by investing in a well-commented Dockerfile. Each section should a
have clear heading and comments to explain its purpose and implementation.
4. Share responsibility for environment management
If you have multiple teams or departments doing separate work in Domino, they should be responsible for
maintaining their own team-specific environments. Find an advanced user in each team and make him or her
a deputy for environment management. This person should be responsible for planning and understanding the
environments his or her team needs, and should work with you on implementation. This reduces the workload
of the admins, and ensures that environments are designed by someone with context on what users need.
5. Keep global images up-to-date and comprehensive
You should strive to have global images that cover the majority of users’ needs. Users should only need to make
minor additions to global environments when creating their own user environments, such as installing a specific
version of a package. You don’t want a situation where users are re-installing Python or making other major
changes, as this will result in a bloated and poorly performing environment.
6. Avoid time-consuming image pulls by caching global environments on the executor machine image
You should cache your global environments in your executor template machine image. This ensures that each
new executor starts up with the base Docker image for any environment already cached. If users are setting up
environments that have base images very different from what is cached on the machine image, it can lead to long
pull times when launching executors. Contact Domino Support for help with modifying your machine image.
7. Clean up old or poorly maintained environments
Create a culture of tidiness around environment creation and management. Enforce a standard of quality in nam-
ing and Dockerfile commenting, and be assertive about pruning unnecessary environments. See the following
section of this document for a walkthrough.
Over time, it’s inevitable that the number of environments in your Domino will grow. It’s valuable to do an occasional
review to weed out unused environments, update active ones, and consolidate where possible. Depending on the size
of your organization and your use of Domino, this may be a yearly or quarterly task.
As an admin, you can see all environments being used across your deployment in the Environments Overview. The
table of environments on this screen can be sorted by the # of Projects column to get a quick understanding of
which environments are in common use. You can also enter global in the search box to filter for global environments.
Click the name of an environment to see Dockerfile details, as well as the list of projects and models using that
environment. The list will also include the date of the last run for each project and model. Keep an eye out for
user environments that make duplicate changes to global base environments, as well as unused or poorly-maintained
environments.
Based on what you learned by reviewing existing environments, you should plan an updated set of global environments
that include the tools and features frequently added by users. In same cases, it might be as simple as adding a few
packages to an existing global environment. You can also create a new global environment when necessary, but we
recommend erring on the side of larger, more consolidated environments. Doing so will make it easier for your users
to choose an environment, and it will be easier for you to manage and maintain the collection of environments in your
deployment.
Before executing your plan and changing the available global environments, it’s best to inform your users of the
impending changes and solicit their feedback. Explain the changes to existing environments, announce the creation of
new ones, and provide recommendations for which environment to use for various types of projects.
As an admin, you have the power to archive any environment. All old projects will still be able to use an archived
environment, but new projects won’t be able to select it. Historical runs will still reference an archived environment,
so archiving never breaks reproducibility. Use archiving to encourage adoption of new, up-to-date, and consolidated
environments. Environments can be un-archived at any time.
When a user launches a Domino Run, part of the start-up process is loading the user’s environment onto the node that
will host the Run. For large images, the process of transferring the image to a new node can take several minutes.
Once an image has been loaded onto a node once, it gets cached, and future Runs that use the same environment will
start up faster.
When running Domino on EKS, you can pre-cache popular environments and base images on the Amazon Machine
Image (AMI) used for new nodes. This can speed up the start time of Runs on new nodes significantly. This page
describes the process of creating a new AMI with cached environments and configuring EKS to use it for new nodes.
In addition to any dependencies required by Kubernetes itself, your AMI should contain the following:
• Docker
• Cache of Domino’s compute environments
• Nvidia-Docker 2 (GPU nodes only)
• Nvidia GPU driver 410+ (GPU nodes only)
• Change the default docker runtime (GPU nodes only)
For simplicity, recommends that you use the official EKS default AMIs, which come pre-configured with Docker and
the GPU tools.
• Click to read about the official EKS AMI Domino recommends for default compute nodes
• Click to read about the official EKS AMI Domino recommends for GPU nodes
Alternatively, you can use Amazon’s build scripts to create your own AMI for use with EKS.
The following sections describe how to perform several important types of operations on an EC2 instance to set it up
as the template for a new AMI suitable for Domino.
Install Docker
Pre-caching environment images is a simple process of running docker pull for the base images those environ-
ments are built on, or the built environments from the internal registry itself.
To pull the Domino Standard Environment base images, your command would look like this, substituting in the version
string for the image you want to cache.
To pull a built image from the Domino internal registry, you will need to find its URI from the Revisions tab in the
environment details page.
For example, to cache revision #9 of the environment shown in the screenshot above, you would run:
Read the official instrctions for installing the nvidia-docker 2.0 runtime.
To use the GPU on a GPU node, you need to install the appropriate driver on the machine image. Domino does not
have a requirement for any specific driver version, however, if you want to use a Domino Standard Environment, it
should be a version that is compatible with the current version of Cuda shown in standard environments.
Click to view a compatibility matrix.
If you’d like to install the GPU drivers manually, you can follow these instructions.
To validate that your GPU machine is configured properly, reboot the machine and run the following:
This will show the driver number and GPU devices if installed successfully.
Read the official instructions from NVIDIA on using the container runtime.
Note that you must restart Docker before this will work.
1. Determine which AMI you want to use as the base for the new AMI. If you’re performing this operation on an
operational Domino node pool, you should use the AMI that’s currently used in the active launch configuration.
Once you’ve identified the name of the active launch configuration, view its details to see the AMI ID it uses.
Disaster recovery
The following systems are canonical stores of critical Domino data, and they are stored and backed up in AWS as
described below.
231
Domino Admin Docs Documentation, Release 4.4.0
The following systems are canonical stores of critical Domino data, and they are stored and backed on-premises as
described below. These methods can also be applied to other clouds for which Domino does not have native storage
integrations.
Control Center
• Overview
• Who can access the Control Center?
• How do I open the Control Center?
• What metrics are available in the Control Center?
• Drilling down for more details
• Control center hardware tier page
• Control center project page
• Control center user page
13.1.1 Overview
The Control Center displays important data about your Domino deployment. From the Control Center, you can view
deployment-wide usage of compute resources by hours of runtime or spend in USD. You can also drill down into
235
Domino Admin Docs Documentation, Release 4.4.0
detailed statistics on projects, hardware tiers, and users. The control center data is also available for export if you’d
like to create your own reports or analysis.
At this time, only Domino Admins can view the Control Center. The Control Center shows detailed deployment-wide
statistics and granular data on users and projects, and its functionality depends on the user having Admin permissions.
If you need access to the Control Center, contact your local Domino Administrator or email sup-
port@dominodatalab.com.
If you have access to the Control Center, you’ll find a link to it in the Switch To menu.
When you first open the Control Center, you’ll see a bar chart of deployment compute spend in USD for each day in
the current month.
Compute spend is based on settings applied by admins when creating and managing hardware tiers. Compute spend
data will only be available in the Control Center if the “cents per minute” property is set on the hardware tier in use.
These numbers also only represent active usage, and do not reflect other potential spend like idle cloud resources or
storage.
You can change the date range shown with the dropdown menu in the upper right, and you can switch the chart to
display compute usage by hours of runtime with the dropdown in the upper left.
Below the deployment-wide chart is a panel that displays more granular data on projects, users, and hardware tiers
across the selected date range. You can chart these by the following metrics:
• Projects can be charted by compute spend (USD) or compute hours
• Users can be charted by compute spend (USD) or compute hours
• Hardware tiers can be charted by average run queue time in minutes
This chart will display the top five results for the chosen metric. When you have the chart set to display data on users,
there will also be a View all link you can use to load a paginated table with detailed usage statistics for all users.
By default this table will show data for the date range that was set on the previous page. There’s a dropdown menu in
the top right you can use to change the date range if desired.
Many of the tables and charts support drilling down for more detail on a specific item. Clicking on one of the bars in
a Control Center bar chart to see an expanded and detailed page on the related project, user, or hardware tier.
Some of these pages will also display a table of related runs. You can click an entry in a Run Logs table to view the
specified run in the project Runs UI.
This page shows performance averages for runs that use the specified hardware tier, and tracks completed runs. Details
on all runs performed on the specified hardware tier are listed in the Run Logs table. Click an entry in the table to
view the specified run in the project Runs UI.
This page breaks down project spend across Apps, Batch Runs, Endpoints, Launchers, Scheduled Runs, and
Workspaces. All runs executed in the project are detailed in the Run Logs table. Click an entry in the table to
view the specified run in the project Runs UI.
This page shows detailed data on a user’s activity in Domino. The top of the page has charts showing the types of runs
this user starts, which projects the user works in, and which hardware tiers the user uses. You can click on bars in the
project and hardware tier charts to view the object represented. All runs started by this user are detailed in the Run
Logs table. Click an entry in the table to view the specified run in the project Runs UI.
The Control Center interface in Domino provides many different views on deployment usage, broken down by hard-
ware tier, project, or user. However, if you want to do a more detailed, custom analysis, it’s possible for Domino
administrators to use the API to export Control Center data for examination with Domino’s data science features or
external business intelligence applications.
The endpoint that serves this data is /v4/gateway/runs/getByBatchId.
Click through to read the REST documentation on this endpoint, or see below for a detailed description plus examples.
To make an API call, you’ll need the API key for your account. In this case, accessing the full deployment’s Control
Center data requires that you use an admin account. Once you’re logged in as an admin, click your username at bottom
left, then click Account Settings.
Click API Key from the settings menu to link down to the API Key panel. Copy the displayed key and keep it handy.
You’ll need it to make requests to the API.
Note that anyone bearing this key could authenticate to the Domino API as you. Treat it like a sensitive password.
Here’s a basic call to the data export endpoint, executed with cURL:
curl --include \
-H "X-Domino-Api-Key: <your-api-key>" \
'https://<your-domino-url>/v4/gateway/runs/getByBatchId'
By default, the endpoint starts with the oldest available run data, beginning from January 1st, 2018. Older data is not
available. The command also has a default limit of 1000 runs worth of data. As written, the call above will return data
on the oldest 1000 runs available.
To try out this example, fill in <your-api-key> and <your-domino-url> in the command above.
The standard JSON response object you receive will have the following scheme:
{
"runs": [
{
"batchId": "string",
"runId": "string",
"title": "string",
"command": "string",
"status": "string",
"runType": "string",
"userName": "string",
"userId": "string",
"projectOwnerName": "string",
"projectOwnerId": "string",
"projectName": "string",
"projectId": "string",
"runDurationSec": 0,
"hardwareTier": "string",
"hardwareTierCostCurrency": "string",
"hardwareTierCostAmount": 0,
"queuedTime": 0,
"startTime": 0,
"endTime": 0,
"totalCostCurrency": "string",
"totalCostAmount": 0
}
],
"nextBatchId": "string"
}
Each run recorded by the Control Center gets a batchId, which is an incrementing field that can be used as a
cursor to fetch data in multiple batches. You can see in the response above, after the array of runs objects there is a
nextBatchId parameter that points to the next run that would have been included.
You can use that ID as a query parameter in a subsequent request to get the next batch:
curl --include \
-H "X-Domino-Api-Key: <your-api-key>" \
'https://<your-domino-url>/v4/gateway/runs/getByBatchId?batchId=<your-batchId-here>'
You can also request the data as CSV by including a header with Accept: text/csv. On the Unix shell, you
can write the response to a file with the > operator. This is a quick way to get data suitable for import into analysis
tools:
curl --include \
-H "X-Domino-Api-Key: <your-api-key>" \
-H 'Accept: text/csv' \
'https://<your-domino-url>/v4/gateway/runs/getByBatchId' > your_file.csv
The below code shows a simple Python script that fetches all Control Center data from the earliest available to a
configurable end date, and writes it to a CSV file. Fill in the date of the last known completed run to fetch all available
historical data.
import requests
import json
import pandas as pd
import os
from datetime import datetime
from datetime import timedelta
URL = "https://<your-domino-url>/v4/gateway/runs/getByBatchId"
headers = {'X-Domino-Api-Key': '<your-api-key>'}
last_date = 'YYYY-MM-DD'
try:
os.remove('output.csv')
except:
pass
batch_ID_param = ""
while True:
batch = requests.get(url = URL + batch_ID_param, headers = headers)
parsed = json.loads(batch.text)
batch_ID_param = "?batchId=" + parsed['nextBatchId']
df = pd.DataFrame(parsed['runs'])
df[df.endTime <= last_date].to_csv('output.csv', mode = "a+", index = False,
˓→header = True)
Running a script like this periodically allows you to easily import fresh data into your tools for custom analysis. You
can work with the data in a Domino project, or make it available to third party tools like Tableau: