You are on page 1of 110

A COLLECTION OF POSTS ON MINIO AND KUBERNETES

Table of Contents

Announcing the MinIO Kubernetes Operator and Operator Console …………………………………… 2


Simplifying Multi-Tenant Object Storage as a Service with Kubernetes and MinIO Operator.. 5
Why Kubernetes Managed Object Storage Matters …………………………………………………………… 14
CI/CD Deploy with MinIO distributed cluster on Kubernetes …………………………………….…………
19
MinIO as Helm Chart Repository ……………………………………………………………………………..…………
38
Building an ML Data Pipeline with MinIO and Kubeflow v2.0 ……………………………………………….
44
How to Set up Kafka and Stream Data to MinIO in Kubernetes …………………………………………… 54
Dremio and MinIO on Kubernetes for Fast Scalable Analytics …………………………………………….. 72
Simplifying Multi-Cloud Kubernetes with MinIO and Rafay…………………………………………………..
80
Spark, MinIO and Kubernetes …………………………………………………………………………………………….
90
About MinIO ………………………………………………………………………………………………………………………
109

A Collection of Posts on MinIO and Kubernetes | 1


Announcing the MinIO Kubernetes Operator and
Operator Console
Daniel Valdivia 7 April 2021

Object-storage-as-a-service is a game changer for IT.

For the better part of a decade, IT has watched as developers provisioned object storage for emerging
applications on the public cloud - driving much of the adoption of this medium.

This creates many well-known issues for IT. This is not a simple control issue, it is a broader and much more
critical governance issue with regard to security, compliance, budget and overall alignment.

The primary driver for developers turning to the public cloud was simply that IT couldn’t provision multi-tenant
object storage as a service. While IT was adept at archival object storage and was able to protect the crown
jewels when it came to data, they simply didn’t have the skill set to create, deploy, tune, scale and manage
modern, application oriented object storage using Kubernetes.

Kubernetes Native DNA

MinIO is purpose-built to take full advantage of the Kubernetes architecture. Created from scratch in the last
five years, MinIO has known nothing but containers and orchestration - it is simply how we think. As a result,
MinIO and Kubernetes work together to simplify infrastructure management, providing a way to manage
object storage infrastructure within the Kubernetes toolset.

The new Operator and Operator Console graphical user interface are an important evolution in our approach.
They solve a key problem for IT (getting them going on Kubernetes) while further simplifying object storage
for developers - without sacrificing granularity or control in the process.

The operator pattern extends Kubernetes's familiar declarative API model with custom resource definitions
(CRDs) to perform common operations like resource orchestration, non-disruptive upgrades, cluster
expansion and to maintain high-availability - operations that were previously handled in a Helm chart.

MinIO Kubernetes Operator

There are two components at play here: Operator and Operator Console.

A Collection of Posts on MinIO and Kubernetes | 2


First, there is the Operator. The Operator uses the command set kubectl that the Kubernetes community was
already familiar with and adds the kubectl minio plugin . The MinIO Operator and the MinIO kubectl plugin
facilitate the deployment and management of MinIO Object Storage on Kubernetes - which is how
multi-tenant object storage as a service is delivered.

Examples include:

■ deploying an application on demand


■ taking and restoring backups of that application's state
■ handling upgrades of the application code alongside related changes such as database schemas or
extra configuration settings
■ publishing a Service to applications that don't support Kubernetes APIs to discover them
■ simulating failure in all or part of your cluster to test its resilience
■ choosing a leader for a distributed application without an internal member election process

The Operator is inherently a command line proposition, but merely providing an Operator wasn’t our goal.
MinIO goes further to simplify creation, deployment and management of Kubernetes native object storage
with a straightforward list of commands that make it easy to execute all of the key capabilities outlined above.

The Operator Console makes Kubernetes object storage easier still. In this graphical user interface, MinIO created
something so simple that anyone in the organization can create, deploy and manage object storage as a service.

A Collection of Posts on MinIO and Kubernetes | 3


Regardless of your chosen interface, Operator or Operator Console, the functionality is effectively the same. The result is
an Operator experience that can be used to deploy MinIO on any Kubernetes distribution, be it OpenShift, vSphere 7.0U1,
Rancher or stock upstream.

The Tenant Mentality

The primary unit of managing MinIO on Kubernetes is the tenant. The best way to think about tenancy is to start with the
Kubernetes cluster. The MinIO Operator can allocate multiple tenants within the same Kubernetes cluster. Each tenant, in
turn, can have different capacity (i.e: a small 500GB tenant vs a 100TB tenant), resources (1000m CPU and
4Gi RAM vs 4000m CPU and 16Gi RAM) and servers (4 pods vs 16 pods), as well a separate configurations
regarding Identity Providers, Encryption and versions.

In multi-tenant configurations, each tenant is a cluster of server pools (independent sets of nodes with their
own compute, network, and storage resources), that, while sharing the same physical infrastructure, are fully
isolated from each other in their own namespaces. Each tenant runs their own MinIO cluster, fully isolated
from other tenants giving them the ability to protect them from any disruption on upgrade, update, security
incidents. Each tenant scales independently by federating clusters across geographies.

Since the server binary is fast and lightweight, MinIO's operator is able to densely co-locate several tenants
and use resources efficiently.

In the spirit of Kubernetes everywhere, MinIO runs on any public cloud provider such as Amazon's EKS (Elastic
Kubernetes Engine), Google's GKE (Google Kubernetes Engine), Google's Anthos or Azure's AKS (Azure
Kubernetes Service).

Kubernetes Object Storage for Everyone

With the introduction of the Operator and the browser-based Operator Console, MinIO has delivered a
material upgrade to its already strong Kubernetes story. Now, without even knowing how to spell Kubernetes,
IT administrators can provision multi-tenant object storage as a service across hybrid cloud environments.

Get started and download MinIO! We have a tutorial, Simplifying Object Storage as a Service with Kubernetes
and MinIO’s Operator, that can help you take the first steps. As always, if you have any questions to join our

A Collection of Posts on MinIO and Kubernetes | 4


Slack Channel or drop us a note at hello@min.io. We are here to help you - whichever interface option you
select.

Simplifying Multi-Tenant Object Storage as a Service


with Kubernetes and MinIO Operator
Daniel Valdivia 11 January 2022

This post was updated on 1.12.22.

Object storage as a service is the hottest concept in storage today. The reason is straightforward: object
storage is the storage class of the cloud and the ability to provision it seamlessly to applications or
developers makes it immensely valuable to enterprises of any size.

The challenge is that object storage as a service has traditionally been very difficult to deliver. Overly
complex, hard to tune for performance, prone to failure at scale etc. While systems like Kubernetes offer
powerful tools for automating the deployment and management of these systems, the overall problem of
complexity remains unsolved as administrators must still invest significant time and effort to deploy even a
small scale object storage resource.

By combining Kubernetes with our new Operator and our Operator Console graphical user interface, MinIO is
changing that dynamic in a big way. It should be stated upfront that MinIO has always obsessed over
simplicity. It permeates everything we do, every design decision we make, every line of code we write.

Nonetheless, we saw even more opportunity for simplification. To do this we created the MinIO Operator and
the MinIO kubectl plugin to facilitate the deployment and management of MinIO Object Storage on
Kubernetes. While the Operator commands were critical for users already proficient with Kubernetes, we also
wanted to address a wider audience so we created a Graphical User Interface for the Operator and
incorporated it into our new MinIO Operator Console to enable anyone in the organization to create, deploy
and manage object storage as a service.

Kubernetes is the platform of the Internet. Given its massive adoption, we chose to remain consistent with the
Kubernetes way of doing things. This meant not using any specialized tools or services to setup MinIO.

The effect is that the MinIO Operator works on any Kubernetes distribution, be it OpenShift, vSphere 7.0u1,
Rancher or stock upstream. Further, MinIO will work on any public cloud provider such as Amazon's EKS

A Collection of Posts on MinIO and Kubernetes | 5


(Elastic Kubernetes Engine), Google's GKE (Google Kubernetes Engine), Google's Anthos or Azure's AKS
(Azure Kubernetes Service).

Pretty much all you need to get started on any distribution of Kubernetes is some storage device that can be
presented to Kubernetes either via Local Persistent Volumes or with a CSI Driver.

Let’s start with a review on using MinIO with the kubectl plugin and a kustomize based approach. You'll need
to install the kubectl tool on a computer with network access to the Kubernetes cluster. See Install and Set Up
kubectl for installation instructions. You may need to contact your Kubernetes administrator for assistance in
configuring your kubectl installation for access to the Kubernetes cluster.

Installation

kubectl plugin

To install MinIO Operator we can leverage it's kubectl plugin which can be installed via krew

kubectl krew install minio

After which we can install the Operator by simply doing

kubectl minio init

Installation with kustomize

Alternatively for anyone who prefers a kustomize based approach our repository supports installing specific
tags, of course you can also use this as the base for your kustomization.yaml file

kubectl apply -k github.com/minio/operator/resources/\?ref\=v4.4.3

Provisioning Object Storage

The analogy we used to represent a MinIO Object Storage cluster is Tenant. We did this to communicate that
with the MinIO Operator one can allocate multiple Tenants within the same Kubernetes cluster. Each tenant, in
turn, can have different capacity (i.e: a small 500GB tenant vs a 100TB tenant), resources (1000m CPU and
4Gi RAM vs 4000m CPU and 16Gi RAM) and servers (4 pods vs 16 pods), as well a separate configurations
regarding Identity Providers, Encryption and versions.

Let's start by creating a small tenant with 16Ti capacity across 4 nodes. We will first create a namespace for
the tenant to be installed called `minio-tenant-1` and then place the tenant there using the `kubectl minio
tenant create` command.

A Collection of Posts on MinIO and Kubernetes | 6


Pay close attention to the storage class. Here we will use the cluster's default storage class - called standard,
but you should use whatever storage class can accommodate 16Ti (or 1Ti persistent volumes).

kubectl create ns minio-tenant-1


kubectl minio tenant create minio-tenant-1 \
--servers 4 \
--volumes 16 \
--capacity 16Ti \
--namespace minio-tenant-1 \

--storage-class standard

This command will output the credentials needed to connect to this tenant. MinIO only displays these
credentials once, so make sure you copy them to a secure location.

Tenant 'minio-tenant-1' created in 'minio-tenant-1' Namespace


Username: admin
Password: dbc978c2-bfbe-41bf-9dc6-699c76bafcd0

Note: Copy the credentials to a secure location. MinIO will not display these

again
+-------------+------------------------+------------------+--------------+-----------------+
| APPLICATION | SERVICE NAME | NAMESPACE | SERVICE TYPE | SERVICE PORT(S) |
+-------------+------------------------+------------------+--------------+-----------------+
| MinIO | minio | minio-tenant-1 | ClusterIP | 443 |
| Console | minio-tenant-1-console | minio-tenant-1 | ClusterIP | 9090,9443 |
+-------------+------------------------+------------------+--------------+-----------------+

Usually a tenant takes a few minutes to provision while the MinIO Operator requests TLS certificates for MinIO
and the Operator Console via Kubernetes Certificate Signing Requests, you can check the progress by doing:

kubectl get tenant -n minio-tenant-1

This will tell you your tenant’s current state:

➜ kubectl get tenants -n minio-tenant-1


NAME STATE AGE
minio-tenant-1 Waiting for MinIO TLS Certificate 19s

A Collection of Posts on MinIO and Kubernetes | 7


After a few minutes the tenant should report an Initialized state, indicating your Object Storage cluster is
ready:

➜ kubectl get tenants -n minio-tenant-1


NAME STATE AGE
minio-tenant-1 Initialized 3m21s

That's it, our Object Storage cluster is up and running, we can access it via kubectl port-forward. To access
MinIO's Console:

➜ kubectl port-forward svc/minio-tenant-1-console 9443:9443 -n minio-tenant-1


Forwarding from 127.0.0.1:9443 -> 9443
Forwarding from [::1]:9443 -> 9443

And then go to https://localhost:9443/ in your local browser

Pretty easy right?

But now let's stop, rewind and remix to add a tenant using the MinIO Console for Operator (a.k.a Operator
UI). To access it we can simply run the kubectl minio proxy command. This will tell us how to access the
Operator UI.

A Collection of Posts on MinIO and Kubernetes | 8


As you can see, it's telling you to visit your local browser's http://localhost:9090/login and it's also telling you
the JWT to access the Console UI

Inside the Operator UI we can see the tenant that we provisioned previously using the kubectl plugin.

To add another one, hit Create Tenant. The first screen will ask a few configuration questions:

A Collection of Posts on MinIO and Kubernetes | 9


1. Name the tenant
2. Select a namespace
3. Select a storage class
4. Size your tenant

If you wish to configure an Identity Provider, TLS Certificates, Encryption or Resources for this tenant I invite
you to play with the Sections on the left where these configuration options reside.

In this screen you can size your tenant with number of servers, number of drivers per server and desired
raw capacity additionally you can get a preview of the usable capacity and the SLA guarantees with each
erasure coding parity value you pick.

Now click Create. That is it.

Going back to the list of tenants we can see our original cli-provisioned tenant next to the tenant created
using the Operator UI. These processes are equivalent. It is only personal preference as to which you select.

A Collection of Posts on MinIO and Kubernetes | 10


Finally, if you are curious about how to provision a MinIO tenant via good old yamls you can get the definition
of a tenant and get familiar with with our Custom Resource Definition:

➜ kubectl get tenant bigdata-storage -o yaml

Which returns

apiVersion: minio.min.io/v2
kind: Tenant
metadata:
name: bigdata-storage
namespace: default
spec:
credsSecret:
name: bigdata-storage-secret
env:
- name: MINIO_STORAGE_CLASS_STANDARD
value: EC:8
exposeServices:
console: true
minio: true
image: minio/minio:RELEASE.2022-01-08T03-11-54Z
imagePullSecret: { }
log:
audit:
diskCapacityGB: 10
image: minio/logsearch:v4.4.3
resources: { }
mountPath: /export
pools:

A Collection of Posts on MinIO and Kubernetes | 11


- affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: v1.min.io/tenant
operator: In
values:
- bigdata-storage
- key: v1.min.io/pool
operator: In
values:
- pool-0
topologyKey: kubernetes.io/hostname
name: pool-0
resources:
limits:
memory: 32Gi
requests:
memory: 2Gi
servers: 4
volumeClaimTemplate:
metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "68719476736"
storageClassName: standard
volumesPerServer: 4
prometheus:
diskCapacityGB: 5
resources: { }
requestAutoCert: true

A Collection of Posts on MinIO and Kubernetes | 12


Conclusion
We've gone to great lengths to simplify the deployment and management of MinIO on Kubernetes. It is simple
to install the Operator and use it to create tenants either by command line or by graphical user interface.
This, however, is just a subset of the features of MinIO on Kubernetes. Each MinIO Tenant has the full feature
set available with baremetal deployments - so you can migrate your existing MinIO deployments to
Kubernetes with full confidence in functionality

I encourage you to try the MinIO Operator yourself and explore other cool features such as using the
Prometheus Metrics and Audit Log, or securing your MinIO Tenant with an external Identity Provider such as
LDAP/Active Directory or an OpenID provider.

No matter what approach you take, the ability to provision muti-tenant, object storage as a service is now
within the skill set of a wide range of IT administrators, developers and architects.

A Collection of Posts on MinIO and Kubernetes | 13


Why Kubernetes Managed Object Storage Matters
Matt Sarrell 6 April 2021

We were talking with a well-respected industry analyst the other day and he challenged us to articulate why
Kubernetes is so important to Object Storage. It got us thinking that this was a topic worthy of our time, and
yours.

At the most basic level, the value of Kubernetes lies in its ability to treat infrastructure as code, delivering full
scale automation to both stateful and stateless components of the software stack.

To derive the maximum amount of value requires treating the maximum number of components as code and
orchestrating those. That means you put EVERYTHING into the container, including applications,
infrastructure and data.

In the modern world, applications are stateless and containerized. Still, that state has to be held somewhere.
That somewhere is object storage (not legacy block and file) and that object storage needs to run IN the
container. When done this way Kubernetes can manage the automation of the infrastructure - both stateful
and stateless.

If the object store is left to bare metal or public cloud storage services, the benefits of Kubernetes based
infrastructure orchestration are considerably diminished.

Another way to think about it is through a VMware analogy. VMware created the concept of the software
defined datacenter. This was a predecessor to Kubernetes (which is why they claim it as their birthright). To
get the true value of SDDC, you have to virtualize the entire datacenter. If some of the applications are left
behind to run on bare metal, SDDC benefits are lost.

The same is true for Kubernetes. If you only use Kubernetes for the applications, you are only tapping a
fractional amount of the value. Let’s explore this a little deeper.

First off, in the modern model, CPU, Network and Storage are physical layers to be abstracted by Kubernetes.
They have to be abstracted so that applications and data stores can run as containers anywhere. In particular,
the data stores include all persistent services (databases, message queues, object stores..).

From the Kubernetes perspective, object stores are not different from any other key value stores or
databases. The storage layer is reduced to physical or virtual drives underneath. The need to run persistent

A Collection of Posts on MinIO and Kubernetes | 14


data stores as containers arises from hybrid cloud portability. Leaving essential services to external physical
appliances or the public cloud takes away the benefits of Kubernetes automation.

This VMware post announcing the reason they built the Data Persistence platform is an excellent resource.
DPp is the answer to the question “how can we allow modern applications to do what they do best, but still
provide the ease of use and transparent operations of the VMware platform to admins and developers?”

Modern applications, in particular, those built to run on Kubernetes, are designed to take care of availability,
replication, scaling and encryption within themselves to become completely independent of the
infrastructure. In turn storage needs to run IN the container in order to deliver Observability, Data Placement,
Maintenance Operations, and Failure Handling.

This was not always the case. Traditionally, applications relied on databases to store and work with structured
data, and storage, such as local drives or distributed file systems, to house all of their unstructured and even
semi-structured data. However, the rapid rise in unstructured data challenged this model. As developers
quickly learned, POSIX was too chatty, had too much overhead to allow the application to perform at scale
and was confined to the data center as it was never meant to provide access across regions and continents.

This led them to object storage, which is designed for RESTful APIs (as pioneered by AWS S3). Now
applications were free of any burden to handle local storage, making them effectively stateless (as the state
is with the remote storage system).

Modern applications are built ground up with this expectation. Well-designed modern applications that deal
with some kind of data (logs, metadata, blobs, etc), conform to the cloud-native (RESTful API) design
principle by saving the state to a relevant storage system.

As a quick side note, REST APIs only address application-storage communication challenges such as PUT and
GET or READ/WRITE data, and tracking metadata and version data, but not container orchestration and
automation. That requires Kubernetes.

SAN and NAS can also make application containers stateless - but POSIX based File and Block are hopelessly
inflexible in a containerized environment - i.e. ability to have application workers grow and shrink based on
inbound load, move to a new node as soon as a current node goes down and so on. This is why object

A Collection of Posts on MinIO and Kubernetes | 15


storage has replaced them as the primary storage class - as evidenced by public cloud’s reliance on object
storage (and pricing of block and file).

This is not to say that storage applications, e.g. databases, object stores, key value stores, must be stateless.
On the contrary, they need to be stateful - they just shouldn’t have the effect of making the application
stateful in the process.

Why MinIO

Kubernetes native storage applications (like MinIO) are designed to leverage the flexibility containers bring.
Agile and DevOps best practices dictate that applications and CI/CD processes be simple and
straightforward, independent of underlying infrastructure and consistent in how it accesses underlying
infrastructure. Simply put, containers need to run the same way everywhere in order to be portable across
development, test, and production. Combining that with variable hardware infrastructures, it makes sense for
Kubernetes to be the point of contact between all the disaggregated infrastructures, applications and data
stores.

Therefore, storage applications cannot make assumptions about the environment in which they are deployed.
For example, MinIO uses an internal erasure coding mechanism to ensure there is adequate redundancy in the
system, across varying hardware and cloud infrastructures, to allow up to half of the drives to fail. MinIO also
manages the data integrity and security using its own hashing and server side encryption.

No application should have to do any of that for itself anymore.

In the Kubernetes world, functions are simplified and abstracted: applications do application things and
storage does storage things. The application doesn’t have to think about it - it just happens, all inside a
container that can be expanded, moved or wiped out.

This is the cloud-native way.

There are certainly non-cloud native ways. For example, you could solve this problem with the Container
Storage Interfaces (CSI), but sophisticated architects and developers don’t because they add needless
complexity and scalability challenges. This is because CSI-based PVs bring their own management and
redundancy layers which generally compete with the stateful application’s design.

Take the following example of how cloud-native platforms work with storage and state. Apache Spark, in the
cloud-native world, runs in a stateless manner on Kubernetes and ships state to other systems while Spark
containers themselves are running completely stateless. Other major enterprise players in the big data
analytics space like Vertica, Teradata, Greenplum are also moving to a disaggregated model of compute and
storage.

Similarly, all the other major analytics platforms from Presto, Tensorflow to R, Jupyter notebooks follow such
patterns. Offloading state to remote cloud storage systems makes your application much easier to scale and
manage. Additionally, it helps keep the application portable to different environments.

A Collection of Posts on MinIO and Kubernetes | 16


MinIO has always thought of storage in this context. A majority of our workloads (523M Docker pulls as of
this morning) run in containers (64%) and almost half are managed by Kubernetes (42%). That is why VMware
picked us as a design partner for the launch of their Data Persistence platform (DPp). We are the standard for
this type of deployment.

We continue to refine our approach. For example, our widely adopted Helm chart approach was not enough
to cross the chasm from our DevOps audience to the mainstream IT administrator audience. Our previous
implementation effectively dealt with a single tenant. For multi-tenancy and other DevOps tasks like
provisioning, scaling, upgrades/updates, monitoring and encryption services - this required customer code.

Our new Kubernetes Operator helps our clients cross the chasm. Building a multi-tenant, self-service object
storage infrastructure on top of MinIO required a significant amount of skills and custom code development.

With the introduction of the Operator, such tasks are automated and API / Web driven. Now MinIO is a full
blown multi-tenant, self-service cloud storage on top of Kubernetes. The Operator and Console put the power
of Kubernetes-native, object-storage-as-a-service into the hands of IT - without requiring CLI or scripting
skills.

MinIO Everywhere

When we started talking about the concept of #minioeverywhere it was to illustrate our integrations with the
cloud-native elite. Now, however, #minioeverywhere speaks to the fact that MinIO, in conjunction with
Kubernetes, runs everywhere.

This can be lost on some given its nuance. Because of key economic and technical hurdles among the public
cloud providers, it is increasingly attractive to use MinIO/Kubernetes across all infrastructures.

For example, public clouds are not interchangeable. AWS S3 does not equal Blob (Azure) and certainly does
not equal GCP (marginally S3 compatible). Also, in the public cloud, bandwidth is more expensive than
storage and latency is high. Smoothing these differences is a very expensive proposition.

Enterprises are adopting MinIO as a core part of their software stack (applications AND storage) because
they can roll it anywhere. AWS, GCP, Azure, Tanzu, Openshift - the list goes on. Because MinIO is Kubernetes

A Collection of Posts on MinIO and Kubernetes | 17


native and runs IN the container - MinIO works out of the box in any Kubernetes environment - from a car or
5G POP to the public cloud. That is why you find 7.7M IPs running MinIO in AWS, GCP and Azure.

All Together Now

There is a lot here so let’s summarize quickly. Kubernetes' value lies in its ability to treat infrastructure as
code, delivering full scale automation to both stateful and stateless components of the software stack.

The value of Kubernetes is only achieved if you can get the maximum number of components inside the
container. This includes storage/persistent data.

MinIO is built for this - it easily fits in containers (~45MB), it is designed for RESTful APIs and continues to
evolve its approach (see MinIO Operator) to deliver the most native Kubernetes experience when it comes to
storage.

When you are native to Kubernetes you can run anywhere it does - and today, that is everywhere you care
about running - public cloud, private cloud, Kubernetes distribution and edge.

Don’t take our word for it. See for yourself. You can pull the MinIO Operator for Kubernetes code from Github.
Questions? Join the conversation on our Slack channel, or hit the Ask an Expert button and get started today.

A Collection of Posts on MinIO and Kubernetes | 18


CI/CD Deploy with MinIO distributed cluster on
Kubernetes
AJ 9 November 2022

Welcome to the third and final installment of our MinIO and CI/CD series. So far, we’ve discussed the basics of
CI/CD concepts and how to build MinIO artifacts and how to test them in development. In this blog post, we’ll
focus on Continuous Delivery and MinIO. We’ll show you how to deploy a MinIO cluster in a production
environment using infrastructure as code to ensure anyone can read the resources installed and apply version
control to any changes.

MinIO is very versatile and could be installed in almost any environment. MinIO conforms to multiple use cases
for developers to have the same environment on a laptop that they work in production using the CI/CD
concepts and pipelines we discussed. We showed you previously how to install MinIO as a docker container
and even as a systemd service. Today we’ll show you how to deploy MinIO in distributed mode in a production
Kubernetes cluster using an operator. We’ll use Terraform to deploy the infrastructure first, then we’ll deploy
the required MinIO resources.

MinIO Network

First we’ll use Terraform to build the basic network needed for our infrastructure to get up and running. We
are going to set up a VPC networking with 3 basic commonly used networking types. Within that network we’ll
launch a Kubernetes cluster where we can deploy our MinIO workloads. The structure of our Terraform
modules would look something like this

modules
├── eks
│ ├── main.tf
│ ├── outputs.tf
│ └── variables.tf
└── vpc
├── main.tf
├── outputs.tf
└── variables.tf

A Collection of Posts on MinIO and Kubernetes | 19


https://github.com/minio/blog-assets/tree/main/ci-cd-deploy/terraform/aws/modules

In order for the VPC to have different networks each subnet requires a unique non overlapping subnet. These
subnets are split into CIDR blocks. For a handful, this is pretty easy to calculate, but for many subnets like we
have here, Terraform provides a handy function cidrsubnet() to split the subnets for us based on a larger
subnet we provide, in this case 10.0.0.0/16.

variable "minio_aws_vpc_cidr_block" {
description = "AWS VPC CIDR block"
type = string
default = "10.0.0.0/16"
}

variable "minio_aws_vpc_cidr_newbits" {
description = "AWS VPC CIDR new bits"
type = number
default =4
}

vpc/variables.tf#L1-L11

Define the VPC resource in Terraform. Any subnet created will be based on this VPC.

resource "aws_vpc" "minio_aws_vpc" {

cidr_block = var.minio_aws_vpc_cidr_block
instance_tenancy = "default"
enable_dns_hostnames = true

vpc/main.tf#L1-L7

Set up 3 different networks: Public, Private and Isolated.

The Public Network with Internet Gateway (IGW) will have inbound and outbound internet access with a
public IP and an Internet Gateway.

A Collection of Posts on MinIO and Kubernetes | 20


variable "minio_public_igw_cidr_blocks" {
type = map(number)
description = "Availability Zone CIDR Mapping for Public IGW subnets"

default = {
"us-east-1b" = 1
"us-east-1d" = 2
"us-east-1f" = 3
}
}

vpc/variables.tf#L15-L24

The aws_subnet resource will loop 3 times creating 3 subnets in the public VPC

resource "aws_subnet" "minio_aws_subnet_public_igw" {

for_each = var.minio_public_igw_cidr_blocks

vpc_id = aws_vpc.minio_aws_vpc.id
cidr_block = cidrsubnet(aws_vpc.minio_aws_vpc.cidr_block, var.minio_aws_vpc_cidr_newbits,
each.value)
availability_zone = each.key

map_public_ip_on_launch = true
}

resource "aws_route_table" "minio_aws_route_table_public_igw" {

vpc_id = aws_vpc.minio_aws_vpc.id

resource "aws_route_table_association" "minio_aws_route_table_association_public_igw" {

for_each = aws_subnet.minio_aws_subnet_public_igw

subnet_id = each.value.id
route_table_id = aws_route_table.minio_aws_route_table_public_igw.id
}

resource "aws_internet_gateway" "minio_aws_internet_gateway" {

A Collection of Posts on MinIO and Kubernetes | 21


vpc_id = aws_vpc.minio_aws_vpc.id

resource "aws_route" "minio_aws_route_public_igw" {


route_table_id = aws_route_table.minio_aws_route_table_public_igw.id
destination_cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.minio_aws_internet_gateway.id
}

vpc/main.tf#L11-L46

The Private Network with NAT Gateway (NGW) will have outbound network access, but no inbound network
access, with a private IP address and NAT Gateway.

variable "minio_private_ngw_cidr_blocks" {
type = map(number)
description = "Availability Zone CIDR Mapping for Private NGW subnets"

default = {
"us-east-1b" = 4
"us-east-1d" = 5
"us-east-1f" = 6
}
}

vpc/variables.tf#L26L-L35

The aws_subnet resource will loop 3 times creating 3 subnets in the private VPC

resource "aws_subnet" "minio_aws_subnet_private_isolated" {

for_each = var.minio_private_isolated_cidr_blocks

vpc_id = aws_vpc.minio_aws_vpc.id
cidr_block = cidrsubnet(aws_vpc.minio_aws_vpc.cidr_block, var.minio_aws_vpc_cidr_newbits,
each.value)
availability_zone = each.key
}

A Collection of Posts on MinIO and Kubernetes | 22


resource "aws_route_table" "minio_aws_route_table_private_isolated" {

vpc_id = aws_vpc.minio_aws_vpc.id

resource "aws_route_table_association" "minio_aws_route_table_association_private_isolated" {

for_each = aws_subnet.minio_aws_subnet_private_isolated

subnet_id = each.value.id
route_table_id = aws_route_table.minio_aws_route_table_private_isolated.id
}

vpc/main.tf#L50-L98

Finally, we create an Isolated and Air-gapped network with neither outbound nor inbound internet access.
This network is completely air gapped with only a private IP address.

variable "minio_private_isolated_cidr_blocks" {
type = map(number)
description = "Availability Zone CIDR Mapping for Private isolated subnets"

default = {
"us-east-1b" = 7
"us-east-1d" = 8
"us-east-1f" = 9
}
}

vpc/variables.tf#L37-L46

The aws_subnet resource will loop 3 times creating 3 subnets in the isolated/air-gapped VPC

resource "aws_subnet" "minio_aws_subnet_private_isolated" {

for_each = var.minio_private_isolated_cidr_blocks

vpc_id = aws_vpc.minio_aws_vpc.id
cidr_block = cidrsubnet(aws_vpc.minio_aws_vpc.cidr_block, var.minio_aws_vpc_cidr_newbits,
each.value)
availability_zone = each.key
}

A Collection of Posts on MinIO and Kubernetes | 23


resource "aws_route_table" "minio_aws_route_table_private_isolated" {

vpc_id = aws_vpc.minio_aws_vpc.id

resource "aws_route_table_association" "minio_aws_route_table_association_private_isolated" {

for_each = aws_subnet.minio_aws_subnet_private_isolated

subnet_id = each.value.id
route_table_id = aws_route_table.minio_aws_route_table_private_isolated.id
}

vpc/main.tf#L102-L123

MinIO Kubernetes Cluster

Create a Kubernetes cluster on which we’ll deploy our MinIO cluster. The
minio_aws_eks_cluster_subnet_ids will be provided by the VPC that we’ll create. Later, we’ll show how
to stitch all this together in the deployment phase.

variable "minio_aws_eks_cluster_subnet_ids" {
description = "AWS EKS Cluster subnet IDs"
type = list(string)
}

variable "minio_aws_eks_cluster_name" {
description = "AWS EKS Cluster name"
type = string
default = "minio_aws_eks_cluster"
}

variable "minio_aws_eks_cluster_endpoint_private_access" {
description = "AWS EKS Cluster endpoint private access"
type = bool
default = true
}

variable "minio_aws_eks_cluster_endpoint_public_access" {
description = "AWS EKS Cluster endpoint public access"
type = bool

A Collection of Posts on MinIO and Kubernetes | 24


default = true
}

variable "minio_aws_eks_cluster_public_access_cidrs" {
description = "AWS EKS Cluster public access cidrs"
type = list(string)
default = ["0.0.0.0/0"]
}

eks/variables.tf#L1-L28

Note: In production you probably don’t want to have public access to the Kubernetes API endpoint because it
could become a security issue as it will open up control of the cluster.

You will also need a couple of roles to ensure the Kubernetes cluster can communicate properly via the
networks we’ve created, and those are defined at eks/main.tf#L1-L29. The Kubernetes cluster definition is as
follows

resource "aws_eks_cluster" "minio_aws_eks_cluster" {


name = var.minio_aws_eks_cluster_name
role_arn = aws_iam_role.minio_aws_iam_role_eks_cluster.arn

vpc_config {
subnet_ids = var.minio_aws_eks_cluster_subnet_ids
endpoint_private_access = var.minio_aws_eks_cluster_endpoint_private_access
endpoint_public_access = var.minio_aws_eks_cluster_endpoint_public_access
public_access_cidrs = var.minio_aws_eks_cluster_public_access_cidrs
}

depends_on = [
aws_iam_role.minio_aws_iam_role_eks_cluster,
]

eks/main.tf#L31-L46

The cluster takes in the API requests made from commands like kubectl, but there’s more to it than that –
the workloads need to be scheduled somewhere. This is where a Kubernetes cluster node group is required.
Below, we define the node group name, the type of instance and the desired group size. Since we have 3 AZs,
we’ll create 3 nodes one for each of them.

A Collection of Posts on MinIO and Kubernetes | 25


variable "minio_aws_eks_node_group_name" {
description = "AWS EKS Node group name"
type = string
default = "minio_aws_eks_node_group"
}

variable "minio_aws_eks_node_group_instance_types" {
description = "AWS EKS Node group instance types"
type = list(string)
default = ["t3.large"]
}

variable "minio_aws_eks_node_group_desired_size" {
description = "AWS EKS Node group desired size"
type = number
default =3
}

variable "minio_aws_eks_node_group_max_size" {
description = "AWS EKS Node group max size"
type = number
default =5
}

variable "minio_aws_eks_node_group_min_size" {
description = "AWS EKS Node group min size"
type = number
default =1
}

eks/variables.tf#L30-L58

You need a couple of roles to ensure the Kubernetes node group can communicate properly, and those are
defined at eks/main.tf#L48-L81. The Kubernetes node group (workers) definition is as follows:

resource "aws_eks_node_group" "minio_aws_eks_node_group" {


cluster_name = aws_eks_cluster.minio_aws_eks_cluster.name
node_group_name = var.minio_aws_eks_node_group_name
node_role_arn = aws_iam_role.minio_aws_iam_role_eks_worker.arn
subnet_ids = var.minio_aws_eks_cluster_subnet_ids
instance_types = var.minio_aws_eks_node_group_instance_types

scaling_config {

A Collection of Posts on MinIO and Kubernetes | 26


desired_size = var.minio_aws_eks_node_group_desired_size
max_size = var.minio_aws_eks_node_group_max_size
min_size = var.minio_aws_eks_node_group_min_size
}

depends_on = [
aws_iam_role.minio_aws_iam_role_eks_worker,
]

eks/main.tf#L83-L100

This configuration will launch a control plane with worker nodes in any of the 3 VPC networks we configured.
We’ll show later the kubectl get no output once the cluster is launched.

MinIO Deployment

By now, we have all the necessary infrastructure in code form. Next, we’ll deploy these resources and create
the cluster on which we’ll deploy MinIO.

Install Terraform using the following command

brew install terraform

Install aws CLI using the following command

brew install awscli

Create an AWS IAM user with the following policy. Note the AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY after creating the user.

A Collection of Posts on MinIO and Kubernetes | 27


Set environmental variables for AWS, as they will be used by `terraform` and awscli.

$ export AWS_ACCESS_KEY_ID=<access_key>
$ export AWS_SECRET_ACCESS_KEY=<secret_key>

Create a folder called hello_world in the same directory as modules using the structure below

├── hello_world

│ ├── main.tf

│ ├── outputs.tf

│ ├── terraform.tfvars

│ └── variables.tf

├── modules

│ ├── eks

│ └── vpc

https://github.com/minio/blog-assets/tree/main/ci-cd-deploy/terraform/aws/hello_world

A Collection of Posts on MinIO and Kubernetes | 28


Create a file called terraform.tfvars and set the following variable

hello_minio_aws_region = "us-east-1"

Create a file called main.tf and initialize the terraform AWS provider and S3 backend. Note that the S3
bucket needs to exist beforehand. We are using S3 backend to store the state so that it can be shared among
developers and CI/CD processes alike without dealing with trying to keep local state in sync across the org.

terraform {
required_version = ">= 1.0"

required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.31.0"
}
}

backend "s3" {
bucket = "aj-terraform-bucket"
key = "tf/aj/mo"
region = "us-east-1"
}

provider "aws" {
region = var.hello_minio_aws_region
}

hello_world/main.tf#L1-L21

Setting the backend bucket and key as variables is not supported, so those values need to be hard coded.

Call the VPC module from main.tf and name it hello_minio_aws_vpc

A Collection of Posts on MinIO and Kubernetes | 29


module "hello_minio_aws_vpc" {
source = "../modules/vpc"

minio_aws_vpc_cidr_block = var.hello_minio_aws_vpc_cidr_block
minio_aws_vpc_cidr_newbits = var.hello_minio_aws_vpc_cidr_newbits

minio_public_igw_cidr_blocks = var.hello_minio_public_igw_cidr_blocks
minio_private_ngw_cidr_blocks = var.hello_minio_private_ngw_cidr_blocks
minio_private_isolated_cidr_blocks = var.hello_minio_private_isolated_cidr_blocks

hello_world/main.tf#L23-L33

These are the variables required by vpc module

hello_minio_aws_vpc_cidr_block = "10.0.0.0/16"

hello_minio_aws_vpc_cidr_newbits = 4

hello_minio_public_igw_cidr_blocks = {
"us-east-1b" = 1
"us-east-1d" = 2
"us-east-1f" = 3
}

hello_minio_private_ngw_cidr_blocks = {
"us-east-1b" = 4
"us-east-1d" = 5
"us-east-1f" = 6
}

hello_minio_private_isolated_cidr_blocks = {
"us-east-1b" = 7
"us-east-1d" = 8
"us-east-1f" = 9
}

hello_world/terraform.tfvars#L3-L22

A Collection of Posts on MinIO and Kubernetes | 30


Once the VPC has been created, the next step is to create the Kubernetes cluster. The only value we will use
from the VPC creation is minio_aws_eks_cluster_subnet_ids. We’ll use the private subnets created by
the VPC

module "hello_minio_aws_eks_cluster" {
source = "../modules/eks"

minio_aws_eks_cluster_name = var.hello_minio_aws_eks_cluster_name
minio_aws_eks_cluster_endpoint_private_access =
var.hello_minio_aws_eks_cluster_endpoint_private_access
minio_aws_eks_cluster_endpoint_public_access = var.hello_minio_aws_eks_cluster_endpoint_public_access
minio_aws_eks_cluster_public_access_cidrs = var.hello_minio_aws_eks_cluster_public_access_cidrs
minio_aws_eks_cluster_subnet_ids =
values(module.hello_minio_aws_vpc.minio_aws_subnet_private_ngw_map)
minio_aws_eks_node_group_name = var.hello_minio_aws_eks_node_group_name
minio_aws_eks_node_group_instance_types = var.hello_minio_aws_eks_node_group_instance_types
minio_aws_eks_node_group_desired_size = var.hello_minio_aws_eks_node_group_desired_size
minio_aws_eks_node_group_max_size = var.hello_minio_aws_eks_node_group_max_size
minio_aws_eks_node_group_min_size = var.hello_minio_aws_eks_node_group_min_size

hello_world/main.tf#L37-L51

These are the variables required by EKS module

hello_minio_aws_eks_cluster_name = "hello_minio_aws_eks_cluster"
hello_minio_aws_eks_cluster_endpoint_private_access = true
hello_minio_aws_eks_cluster_endpoint_public_access = true
hello_minio_aws_eks_cluster_public_access_cidrs = ["0.0.0.0/0"]
hello_minio_aws_eks_node_group_name = "hello_minio_aws_eks_node_group"
hello_minio_aws_eks_node_group_instance_types = ["t3.large"]
hello_minio_aws_eks_node_group_desired_size = 3
hello_minio_aws_eks_node_group_max_size = 5
hello_minio_aws_eks_node_group_min_size = 1

hello_world/terraform.tfvars#L24-L32

Finally we’ll apply the configuration. While still in the hello_world directory run the following terraform
commands. This will take about 15-20 minutes to get the entire infrastructure up and running. Towards the
end, you should see an output similar to below:

A Collection of Posts on MinIO and Kubernetes | 31


$ terraform init

…TRUNCATED…

$ terraform apply

…TRUNCATED…

hello_minio_aws_eks_cluster_name = "hello_minio_aws_eks_cluster"
hello_minio_aws_eks_cluster_region = "us-east-1"

…TRUNCATED…

Finished: SUCCESS

Update your --kubeconfig default configuration to use the cluster we just created using aws eks
command. The --region and --name are available from the previous output.

$ aws eks --region us-east-1 update-kubeconfig \


--name hello_minio_aws_eks_cluster

Check to verify that you can get a list of nodes

$ kubectl get no
NAME STATUS ROLES AGE VERSION
ip-10-0-105-186.ec2.internal Ready <none> 3d8h v1.23.9-eks-ba74326
ip-10-0-75-92.ec2.internal Ready <none> 3d8h v1.23.9-eks-ba74326
ip-10-0-94-57.ec2.internal Ready <none> 3d8h v1.23.9-eks-ba74326

Next, install EBS drivers so gp2 PVCs can mount. We are using gp2 because this is the default storage class
supported by AWS.

Set credentials for the AWS secret using the same credentials used for awscli

A Collection of Posts on MinIO and Kubernetes | 32


kubectl create secret generic aws-secret \
--namespace kube-system \
--from-literal "key_id=${AWS_ACCESS_KEY_ID}" \
--from-literal "access_key=${AWS_SECRET_ACCESS_KEY}"

Apply the EBS drivers resources:

$ kubectl apply -k
"github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.12"

Your Kubernetes cluster should be ready now.

Now we’re ready to deploy MinIO. First, clone the MinIO repository

$ git clone https://github.com/minio/operator.git

Since this is AWS, we need to update the storageClassName to gp2. Open the following file and update any
references from storageClassName: standard to storageClassName: gp2. Each MinIO tenant has its
own tenant.yaml that contains the storageClassName configuration. Based on the tenant you are using, be
sure to update the storageClassName accordingly.

$ vim ./operator/examples/kustomization/base/tenant.yaml

Apply the resources to Kubernetes to install MinIO

$ kubectl apply -k operator/resources

$ kubectl apply -k operator/examples/kustomization/tenant-lite

Wait at least 5 minutes for the resources to come up, then verify that MinIO is up and running.

$ kubectl -n tenant-lite get po -o wide


NAME READY STATUS RESTARTS AGE IP NODE
NOMINATED NODE READINESS GATES
storage-lite-log-0 1/1 Running 0 17m 10.0.94.169 ip-10-0-94-57.ec2.internal
<none> <none>
storage-lite-log-search-api-66f7db97f5-j268m 1/1 Running 3 (17m ago) 17m 10.0.93.40
ip-10-0-94-57.ec2.internal <none> <none>
storage-lite-pool-0-0 1/1 Running 0 17m 10.0.88.36 ip-10-0-94-57.ec2.internal
<none> <none>
storage-lite-pool-0-1 1/1 Running 0 17m 10.0.104.48 ip-10-0-105-186.ec2.internal
<none> <none>

A Collection of Posts on MinIO and Kubernetes | 33


storage-lite-pool-0-2 1/1 Running 0 17m 10.0.71.81 ip-10-0-75-92.ec2.internal
<none> <none>
storage-lite-pool-0-3 1/1 Running 0 17m 10.0.94.183 ip-10-0-94-57.ec2.internal
<none> <none>
storage-lite-prometheus-0 2/2 Running 0 15m 10.0.85.181
ip-10-0-94-57.ec2.internal <none> <none>

If you notice the above output, each storage-lite-pool- is on a different worker node. Two of them share
the same node because we have 3 nodes, but that is okay because we only have 3 availability zones (AZs).
Basically there are 3 nodes in 3 AZs and 4 MinIO pods with 2 PVCs each which is reflected in the status 8
Online below.

$ kubectl -n tenant-lite logs storage-lite-pool-0-0

…TRUNCATED…

Status: 8 Online, 0 Offline.


API: https://minio.tenant-lite.svc.cluster.local
Console: https://10.0.88.36:9443 https://127.0.0.1:9443

Documentation: https://min.io/docs/minio/linux/index.html

You will need the TCP port of the MinIO console; in this case it is 9443.

$ kubectl -n tenant-lite get svc | grep -i console


storage-lite-console ClusterIP 172.20.26.209 <none> 9443/TCP 6s

With this information, we can set up Kubernetes port forwarding. We chose port 39443 for the host, but this
could be anything, just be sure to use this same port when accessing the console through a web browser.

$ kubectl -n tenant-lite port-forward svc/storage-lite-console 39443:9443

Forwarding from 127.0.0.1:39443 -> 9443

Forwarding from [::1]:39443 -> 9443

Access MinIO Operator Console through the web browser using the following credentials:

URL: https://localhost:39443

User: minio

A Collection of Posts on MinIO and Kubernetes | 34


Password: minio123

You now have a fully production setup of a distributed MinIO cluster. Here is how you can automate it using
Jenkins:

A Collection of Posts on MinIO and Kubernetes | 35


A Collection of Posts on MinIO and Kubernetes | 36
Here is the execute shell command in text format

Unset
export PATH=$PATH:/usr/local/bin
cd ci-cd-deploy/terraform/aws/hello_world/
terraform init
terraform plan
terraform apply -auto-approve

Final Thoughts

In these past few blogs of the CI/CD series we’ve shown you how nimble and flexible MinIO is. You can build it
into anything you want using Packer and deploy it in VMs or Kubernetes clusters wherever it is needed. This
allows your developers to have as close to a production infrastructure as possible in their development
environment, while at the same time leveraging powerful security features such as Server Side Object
Encryption and managing IAM policies for restricting access to buckets.

In a production environment, you might want to restrict the IAM user to a specific policy but that really
depends on your use cases. For demonstration purposes, we kept things simple with a broad policy, but in
production you would want to narrow it down to specific resources and groups of users. In a later blog we’ll
show some of the best practices on how to design your infrastructure for different AZs and regions.

Would you like to try automating the kubectl part as well with Jenkins instead of applying manually? Let us
know what type of pipeline you’ve built using our tutorials for planning, deploying, scaling and securing MinIO
across the multicloud, and reach out to us on our Slack and share your pipelines!

A Collection of Posts on MinIO and Kubernetes | 37


MinIO as Helm Chart Repository
AJ 16 October 2023

If you are part of a team running infrastructure whether it is DevOps, SRE or Systems Engineer its paramount
to ensure you keep tech debt to a minimum. In this case you want to ensure the number of supporting
systems in your infrastructure such as Databases, Cache Systems, Messaging Queues, Log Aggregators,
Monitoring Systems, Application Performance Monitoring systems and I’m sure I’m missing a few more here do
not add to the overall complexity of managing the infrastructure.

As an Engineer you always want to ensure the infrastructure you are setting up is something that could be
used by multiple teams for multiple applications and projects. If you are picking a database do your due
diligence and pick one that could serve most, if not all, of your applications. Having too many of these
disparate database servers running alongside each other will add to the complexity of installing them,
updating, maintenance, backups, testing the backups, monitoring, among other things that really takes up a
lot of the resources.

Same is true with storage systems. One of things you want to make sure is that the storage systems for
backups and DR are not used just for those purposes but the other way around. Meaning, you should have
storage systems that support object storage, DB external tables storage, Metrics and Logs Storage,
Configuration Management data, AI/ML data in Data Lakes, Big Data such as Spark among countless other
use cases. This way you have a storage infrastructure that is not only resilient, but also scalable because
there are many different teams and applications that are depending on Storage more than ever. While the
overhead of managing myriad pieces of infrastructure is now minimized, it becomes critical to ensure the
underlying infrastructure powering them is also scalable, reliable and performant. MinIO fits the bill for more
than several of these use cases because of its industry-leading performance and scalability. MinIO is capable
of tremendous performance – we’ve benchmarked it at 325 GiB/s (349 GB/s) on GETs and 165 GiB/s (177
GB/s) on PUTs with just 32 nodes of off-the-shelf NVMe SSDs – and is used to build data lakes/lake houses
and analytics and AI/ML workloads. With MinIO playing a critical role in storage infrastructure, it's important to
collect, monitor and analyze performance and usage metrics.

MinIO is also a major proponent of Kubernetes, because of its open source nature and the vast resources
available in the way applications are deployed. Just as MinIO can be deployed anywhere, it was a no-brainer
why we wanted to have more ways to deploy MinIO via containers. From the get go MinIO supported the
Docker way of deployment which is the easiest way to get up and running. You might also know we maintain
and distribute our own Helm Chart to deploy the MinIO Operator and other components. This helps us deploy
MinIO alongside any Kubernetes Application whether they use Helm or Not.

A Collection of Posts on MinIO and Kubernetes | 38


But did you wonder how those charts actually get deployed? Today we’ll show you how you can use your
MinIO cluster (on-prem, in the cloud, or on Kubernetes) to be used as a Helm Chart repository.

Install MinIO

Before we set up and configure the Helm repository please ensure you have a working MinIO cluster which
you are already using for existing data. If you just want to test this out first please follow the instructions
below to spin up a MinIO container quickly.

We’ll bring up a MinIO node with 4 disks. MinIO runs anywhere - physical, virtual or containers - and in this
overview, we will use containers created using Docker.

For the 4 disks, create directories on the host for minio:

mkdir -p /home/aj/minio/disk-1 \
mkdir -p /home/aj/minio/disk-2 \
mkdir -p /home/aj/minio/disk-3 \
mkdir -p /home/aj/minio/disk-4

Launch the Docker container with the following specifications for the MinIO node:

docker run -d \
-p 20091:9001 \
-v /home/aj/minio/disk-1:/mnt/disk1 \
-v /home/aj/minio/disk-2:/mnt/disk2 \
-v /home/aj/minio/disk-3:/mnt/disk3 \
-v /home/aj/minio/disk-4:/mnt/disk4 \
--name minio \
--hostname minio \
quay.io/minio/minio server http://minio/mnt/disk{1...4}/minio --console-address ":9001"

The above will launch a MinIO service in Docker with the console port listening on 20091 on the host. It will
also mount the local directories we created as volumes in the container and this is where MinIO will store its
data. You can access your MinIO service via http://localhost:20091.

Status: 4 Online, 0 Offline.


API: http://172.20.0.2:9000 http://127.0.0.1:9000
Console: http://172.20.0.2:9001 http://127.0.0.1:9001

Documentation: https://docs.min.io

A Collection of Posts on MinIO and Kubernetes | 39


If you see 4 Online that means you’ve successfully set up the MinIO node with 4 drives.

Go to the browser to load the MinIO console using http://localhost:20091, log in using minioadmin and
minioadmin for username and password respectively. Click on the Create Bucket button and create
testbucket123.

Installing Helm

Another similarity Helm shares with MinIO is its single binary. You don’t need to install multiple files or
dependencies in different locations dealing with path issues. You simple run the command below and you’ll be
able to get going

Unset
$ curl -fsSL -o get_helm.sh
https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3

$ chmod 700 get_helm.sh

$ ./get_helm.sh

Run Helm repo update to make sure all is working well

Unset
helm repo update

A Collection of Posts on MinIO and Kubernetes | 40


Configuring MinIO with Helm

In order to set up Helm as a repository in MinIO, we’ll need to create a separate bucket. You can name this as
helm-repo or use the testbucket123 we used earlier. Ensure this bucket is public and accessible over
http/https without any API or libraries.

In this bucket we’ll need to create an Index.yaml. Think of it as the index.html on a website. It is the first page
that the browser looks for when accessing a website. Similarly, when the helm command is run it will look for
the index.yaml at the root of the MinIO bucket which will have a list of packages that are available in that
bucket which Helm can use. Think of it as a combination of index.html and the contents of a Site Map page so
you would know exactly where you want to go all the time. This is the same way Helm package manager uses
the index.yaml in the MinIO bucket.

Now let's go ahead and set up the repo.

Create an index.yaml with the following contents

Unset
apiVersion: v1

entries:

minio:

- apiVersion: v1

created: 2017-06-15T17:48:36.895822482Z

description: Distributed object storage server built for cloud applications


and

devops.

digest: 75ff1e3d779d8937cff57c28a102da97a520245d50e22c1a2763cbea064a76cd

home: https://minio.io

icon: https://www.minio.io/img/logo_160x160.png

keywords:

- storage

A Collection of Posts on MinIO and Kubernetes | 41


- object-storage

- S3

maintainers:

- email: hello@acale.ph

name: Acaleph

- email: hello@minio.io

name: Minio

name: minio

sources:

- https://github.com/minio/minio

urls:

- https://play.minio.io:9000/minio-helm/minio-0.1.2.tgz

version: 0.1.2

Upload the index.yaml to the root of the bucket

Unset
mc cp ./index.yaml myminio/testbucket123

Upload the chart’s tarball to the bucket root as well

Unset
mc cp minio-0.1.2.tgz myminio/testbucket123

Add the MinIO bucket public http endpoint

A Collection of Posts on MinIO and Kubernetes | 42


Unset
helm repo add myrepo http://localhost:20091/testbucket123

The moment of truth! Try install the Helm package from the MinIO bucket you added as a Helm repo

Unset
helm install myrepo/minio

Single Pane of Glass

This term is generally used when describing monitoring. But it can also be used to explain the state of your
infrastructure. By having yet another useful way to use your MinIO clusters, you have automatically reduced
the tech debt of your team and the larger organization. By using MinIO you can also leverage our SUBNET
support portal where a team of engineers who actually write the core codebase of MinIO are available to help
you tackle some of the cluster’s important tasks such as architecting the initial set up, developing a DR plan
and scaling the cluster as the need for more data, applications and use cases increases.

For more information, ask our experts using the live chat at the bottom right of the blog to learn more about
the SUBNET experience or email us at hello@min.io.

A Collection of Posts on MinIO and Kubernetes | 43


Building an ML Data Pipeline with MinIO and
Kubeflow v2.0
Keith Pijanowski 25 may 2023

Kubeflow Pipelines (KFP) is the most popular feature of Kubeflow. A Python engineer can turn a function
written in plain old Python into a component that runs in Kubernetes using the KFP decorators. If you used
KFP v1, be warned - the programming model in KFP v2 is very different - however, it is a big improvement.
Transforming plain old Python into reusable components and orchestrating these components into pipelines is
a lot easier.

In this post I want to go beyond the obligatory “Hello World” demo and present something that I hope you will
find either directly usable or at the very least a framework for plugging in your own logic.

What I will do is show how to build a KFP Pipeline that downloads US Census Bureau Data (which is a public
data set that is free to access) and saves this data to MinIO. MinIO is a great way to store your ML data and
models. Using MinIO, you can save training sets, validation sets, test sets, and models without worrying about
scale or performance. Also, someday AI will be regulated; when this day comes, you will need MinIO’s
enterprise features (object locking, versioning, encryption, and legal locks) to secure your data at rest and to
make sure you do not accidentally delete something that a regulatory agency may request.

You can learn more about the data we will be using here. To get an API key for the Census API, go to the
Census Bureau’s site for developers. This is very simple. All you need to do is specify an email address.

What We Will Build

In this post, I will build a pipeline that takes a table code (an identifier within the Census Bureau’s dataset) and
a year as parameters. It will then download the table via an API, if we have not previously downloaded it.

We will only call the Census API if we have not previously downloaded the table. When we call the ACS API,
we will save the data in an instance of MinIO that we set up for storing raw data. This is different from the
MinIO instance KFP uses internally. We could have tried to use KFP’s instance of MinIO - however, this is not
the best design for an ML Data Pipeline. You will want a storage solution that is totally under your control for
the reasons I described earlier. Below is a diagram of our Kubeflow and MinIO deployments that illustrates the
purpose of each instance of MinIO.

A Collection of Posts on MinIO and Kubernetes | 44


Before we start writing code, let's create a logical design of our pipeline.

Logical Pipeline Design

The pipelines you run in KFP are known as Directed Acyclic Graphs (DAG). They move in one direction and do
not backtrack - no closed loops. This is what you would expect of a data pipeline. Below is the logical design
of the DAG we will build and run in KFP. It is self-explanatory. Starting with a conceptual workflow is a good
way to help you transform your logic into functions that will leverage KFP to the fullest.

Now that we have a logical design, let’s start coding. I am going to assume you have KFP installed and that
you also have set up your own instance of MinIO. If you do not have KFP 2.0 and MinIO installed, check out
Setting up a Development Machine with Kubeflow Pipeline 2.0 and MinIO.

Creating Python Functions from a Logical Design

Each task in the logical design above is going to become a Python function. The function signatures below
show how the parameters and return values would be designed if we were writing a Python script or stand
alone service without KFP. I want to discuss this in case you are migrating existing code to KFP

def survey_data_exists(survey_code: str, year: int) -> bool:


'''Check MinIO to see if the survey data exists.'''
pass

def download_survey_data(table_code: str, year: int) -> pd.DataFrame:

A Collection of Posts on MinIO and Kubernetes | 45


'''Download the survey data using the CB API and return a Pandas dataframe.'''
pass

def save_survey_data(bucket: str, object_name: str, survey_df: pd.DataFrame) -> None:


'''Save the survey data which is a Pandas dataframe to the MinIO bucket.'''
pass

def get_survey_data(bucket: str, object_name: str) -> pd.DataFrame:


pass

A few comments about the functions above. They use type hints. If you are writing plain old Python, you can
opt out of type hints because they are optional. In Kubeflow Pipelines, they are not - you must use type hints
so that KFP can tell you if your parameters and return values do not match when assembling functions into a
pipeline. This is a good thing. KFP will find type mismatch errors when you compile your pipeline. These same
errors would be very hard to track down at runtime within a cluster.

It may be tempting to combine functions so that you have fewer functions to manage. For example, the last
three functions could be combined into one by using a simple “if else” statement and then the first function
would not be needed. This is not a best practice when using a tool like KFP. As we will see, KFP has
constructs for conditions and loops. By using KFPs constructs you will get better visualizations of your
pipeline in the KFP UI. Parallelisms are also possible which will improve pipeline performance. Finally, if we
keep our functions simple we will get better reuse.

We are now ready to create Kubeflow Pipeline components using our Python functions.

Creating KFP Components from Python Functions

The code below is the complete implementation of our Pipeline components. When you use tools like KFP and
MinIO, you really do not have a lot of plumbing code to write.

@dsl.component(packages_to_install=['minio==7.1.14'])
def table_data_exists(bucket: str, table_code: str, year: int) -> bool:
'''
Check for the existence of Census table data in MinIO.
'''
from minio import Minio
from minio.error import S3Error
import logging

object_name=f'{table_code}-{year}.csv'

logger = logging.getLogger('kfp_logger')
logger.setLevel(logging.INFO)

A Collection of Posts on MinIO and Kubernetes | 46


logger.info(bucket)
logger.info(table_code)
logger.info(year)
logger.info(object_name)

try:
# Create client with access and secret key.
client = Minio('host.docker.internal:9000',
'Access key here.',
'Secret key here.',
secure=False)

bucket_found = client.bucket_exists(bucket)
if not bucket_found:
return False

objects = client.list_objects(bucket)
found = False
for obj in objects:
logger.info(obj.object_name)
if object_name == obj.object_name: found = True

except S3Error as s3_err:


logger.error(f'S3 Error occurred: {s3_err}.')
except Error as err:
logger.error(f'Error occurred: {err}.')

return found

@dsl.component(packages_to_install=['pandas==1.3.5', 'requests'])
def download_table_data(dataset: str, table_code: str, year: int, table_df: Output[Dataset]):
'''
Returns all fields for the specified table. The output is a DataFrame saved to csv.
'''
import logging
import pandas as pd
import requests

logger = logging.getLogger('kfp_logger')
logger.setLevel(logging.INFO)

census_endpoint = f'https://api.census.gov/data/{year}/{dataset}'

A Collection of Posts on MinIO and Kubernetes | 47


census_key = 'Census key here.'

# Setup a simple dictionary for the requests parameters.


get_token = f'group({table_code})'
params = {'key': census_key,
'get': get_token,
'for': 'county:*'
}

# sending get request and saving the response as response object


response = requests.get(url=census_endpoint, params=params)

# Extract the data in json format.


# The first row of our matrix contains the column names. The remaining rows
# are the data.
survey_data = response.json()
df = pd.DataFrame(survey_data[1:], columns = survey_data[0])
df.to_csv(table_df.path, index=False)
logger.info(f'Table {table_code} for {year} has been downloaded.')

@dsl.component(packages_to_install=['pandas==1.3.5', 'minio==7.1.14'])
def save_table_data(bucket: str, table_code: str, year: int, table_df: Input[Dataset]):
import io
import logging
from minio import Minio
from minio.error import S3Error
import pandas as pd

object_name=f'{table_code}-{year}.csv'

logger = logging.getLogger('kfp_logger')
logger.setLevel(logging.INFO)
logger.info(bucket)
logger.info(table_code)
logger.info(year)
logger.info(object_name)

df = pd.read_csv(table_df.path)

try:
# Create client with access and secret key
client = Minio('host.docker.internal:9000',

A Collection of Posts on MinIO and Kubernetes | 48


'Access key here.',
'Secret key here.',
secure=False)

# Make the bucket if it does not exist.


found = client.bucket_exists(bucket)
if not found:
logger.info(f'Creating bucket: {bucket}.')
client.make_bucket(bucket)

# Upload the dataframe as an object.


encoded_df = df.to_csv(index=False).encode('utf-8')
client.put_object(bucket, object_name, data=io.BytesIO(encoded_df), length=len(encoded_df),
content_type='application/csv')
logger.info(f'{object_name} successfully uploaded to bucket {bucket}.')
logger.info(f'Object length: {len(df)}.')

except S3Error as s3_err:


logger.error(f'S3 Error occurred: {s3_err}.')
except Error as err:
logger.error(f'Error occurred: {err}.')

@dsl.component(packages_to_install=['pandas==1.3.5', 'minio==7.1.14'])
def get_table_data(bucket: str, table_code: str, year: int, table_df: Output[Dataset]):
import io
import logging
from minio import Minio
from minio.error import S3Error
import pandas as pd

object_name=f'{table_code}-{year}.csv'

logger = logging.getLogger('kfp_logger')
logger.setLevel(logging.INFO)
logger.info(bucket)
logger.info(table_code)
logger.info(year)
logger.info(object_name)

# Get data of an object.


try:
# Create client with access and secret key

A Collection of Posts on MinIO and Kubernetes | 49


client = Minio('host.docker.internal:9000',
'Access key here.',
'Secret key here.',
secure=False)

response = client.get_object(bucket, object_name)


df = pd.read_csv(io.BytesIO(response.data))
df.to_csv(table_df.path, index=False)
logger.info(f'Object: {object_name} has been retrieved from bucket: {bucket} in MinIO object storage.')
logger.info(f'Object length: {len(df)}.')

except S3Error as s3_err:


logger.error(f'S3 Error occurred: {s3_err}.')
except Error as err:
logger.error(f'Error occurred: {err}.')

finally:
response.close()
response.release_conn()

The most important fact to keep in mind as you implement and troubleshoot these functions is that at runtime
they are not functions at all. They will be components. In other words, KFP will take each function and deploy
it to its own container. This sample uses Lightweight Python Components. You can also use containerized
Python components which give you more control over what is put into the container. There is also a
containerized components option for non-Python code.

KFP introduces several constructs to help you seamlessly create functions that can behave as standalone
components running in a container. They are the component decorator, parameters, and artifacts. Let’s walk
through these tools so that you understand how KFP deploys functions and passes data between them at run
time.

Components

The component decorator tells KFP that a function should be deployed as a component. Carefully look at how
this decorator is used in the code above. Since the function will be deployed separately to a container you
need to tell KFP its dependencies. This is done using the packages_to_install parameter of the decorator. This
only ensures that dependencies are installed (via pip). It does not import them for you. You need to do this
yourself within the function definition. This may look a little unorthodox as most of us are used to importing
dependencies at the module level - but is OK when using a tool like KFP that turns functions into services.

Passing data between components must be done with care. KFP v2 makes the distinction between
parameters and artifacts. Parameters are for simple data that is passed between function calls (int, bool, str,
float, list, dict). Artifacts, on the other hand, represent data that your functions retrieve from an external

A Collection of Posts on MinIO and Kubernetes | 50


source or create - such as datasets, models, and metrics that depict the accuracy of your model. You can
even use artifact to create HTML and Markdown if you want to style your output such that it is more
presentable in the Kubeflow UI. Since artifacts can be large KFP uses its own instance of MinIO to store them.

Parameters (and return values)

KFP makes use of Python type hints for specifying simple input parameters and simple return values. You are
limited to using str, int, float, bool, list, and dict. The table_data_exists function above shows how parameters
are specified in a function signature. Syntactically, you specify these the same way you would with standard
Python. Remember using type hints is a requirement. At runtime, KFP takes care of marshaling these values
between components - which are running in different containers.

If a function requires a more complicated data type as an input or if it returns a complicated data type then
use artifacts.

Artifacts

Artifacts are different from input parameters and output values in that they may get large. Examples of an
artifact are: a dataset, a model, metrics (the results of ML training efforts), HTML, and Mark Down. Under the
hood, KFP uses its own instance of MinIO to store artifacts. When you pass an artifact from one component to
another KFP does not pass the artifact directly - rather it stores the artifact in MinIO and passes a reference
to the artifact (object) in MinIO. This is really clever. It means that if you have a large artifact that needs to be
accessed by several components then the artifact can be efficiently accessed by these components - since
MinIO is purpose built for efficient object storage and access.

Let’s look at what happens when you pass an artifact to a component. In the code sample above,
save_table_data shows how this is done. Before your function is invoked, KFP copies the artifact from its
instance of MinIO to the local file system of the container your component is running in. Your code will need to
read this file. This is done using the path attribute of the parameter you declared to be of type Input[Dataset].
In the save_table_data function, I read this file into a Pandas DataFrame.

Output artifacts are specified as function parameters and cannot be the return value of a function. In the code
above, get_table_data shows how to use output artifacts. Notice that the table_df parameter has a datatype
of Output[Dataset]. To successfully return data from a function, you must write the data to the location
specified in the parameter’s path attribute. Again, this is a reference to the local file system in your container
- KFP will take care of moving this file to its instance of MinIO when your function completes.

We are now ready to assemble our components into a pipeline.

Creating Pipelines from Components

The code below creates our pipeline (or DAG) from the components we implemented in the previous section.

A Collection of Posts on MinIO and Kubernetes | 51


@dsl.pipeline(
name='census-pipeline',
description='Pipeline that will download Census data and save to MinIO.'
)
def census_pipeline(bucket: str, dataset: str, table_code: str, year: int) -> Dataset:
# Positional arguments are not allowed.
# When I set the name parameter of the condition that task in the DAG fails.

exists = table_data_exists(bucket=bucket, table_code=table_code, year=year)

with dsl.Condition(exists.output == False):


table_data = download_table_data(dataset=dataset, table_code=table_code, year=year)
save_table_data(bucket=bucket,
table_code=table_code,
year=year,
table_df=table_data.outputs['table_df'])

with dsl.Condition(exists.output == True):


table_data = get_table_data(bucket=bucket,
table_code=table_code,
year=year)

return table_data.outputs['table_df']

There are a few things worth noting in this function. First, the pipeline decorator is telling KFP that this
function contains our pipeline definition. The name and description you specify here will show up in the KFP
UI.

Next, the return value of this pipeline function is a Dataset. It turns out that pipelines can be used just like
components. When a pipeline has a return value then it can be used within another pipeline. This is a great
way to reuse components.Finally, we are using the dsl.Condition (which is a Python context manager) to only
call our download component if the data we need is not already in our instance of MinIO. We could have used
a conventional if statement here. However, if we did then KFP would not have any way of knowing that we
have a branch in our logic. By using the dsl.Condition construct we are telling KFP about a branch in our
pipeline. This will allow the KFP UI to give us a better visual representation.

Running a Pipeline

Once you have your components and your pipeline implemented you are two lines of code away from running
your pipeline.

A Collection of Posts on MinIO and Kubernetes | 52


client = Client()

run = client.create_run_from_pipeline_func(census_pipeline, experiment_name='Implementing functions',


enable_caching=False,
arguments={
'bucket': 'census-data',
'table_code': 'B01001',
'year': 2020
}
)

Choose a meaningful experiment name. The KFP UI has an experiments tab that will group runs with the same
experiment name. The code above “compiles” your pipeline and components - which is merely the act of
putting everything into a YAML file (including your source code). If you have any type mismatches that I
described earlier, then you will find out about these problems while creating the run. This code will also send
your pipeline to KFP and run it. Below is a screenshot showing a few successful runs of our pipeline.

Summary

In this post we created a data pipeline that uses KFP and MinIO to download and save US Census data. To do
this we set up our own instance of MinIO for storing raw data. This is an important piece of an ML pipeline -
someday AI will be regulated and having a storage solution under your control allows you to version, lock, and
encrypt data used for training and the models themselves.

We also discussed how KFP uses its own instance of MinIO to efficiently save and access artifacts during
pipeline runs.

In my next post, I will show how this data pipeline can be used as input to another pipeline that uses Census
data to train a model. If you have questions, drop us a line at hello@min.io or join the discussion on our
general Slack channel.

A Collection of Posts on MinIO and Kubernetes | 53


How to Set up Kafka and Stream Data to MinIO in
Kubernetes
Dileeshvar Radhakrishnan, AJ 24 April 2023

Apache Kafka is an open-source distributed event streaming platform that is used for building real-time data
pipelines and streaming applications. It was originally developed by LinkedIn and is now maintained by the
Apache Software Foundation. Kafka is designed to handle high volume, high throughput, and low latency data
streams, making it a popular choice for building scalable and reliable data streaming solutions.

Some of the benefits of Kafka include:

■ Scale and Speed: Handling large-scale data streams and millions of events per second, and .scales
horizontally by adding more Kafka brokers to the cluster

■ Fault Tolerance: Replicating data across multiple brokers in a Kafka cluster ensures that data is
highly available and can be recovered in case of failure, making Kafka a reliable choice for critical
data streaming applications
■ Versatility: Support for a variety of data sources and data sinks making it highly versatile. It can be
used for building a wide range of applications, such as real-time data processing, data ingestion, data
streaming, and event-driven architectures
■ Durability: All published messages are stored for a configurable amount of time, allowing consumers
to read data at their own pace. This makes Kafka suitable where data needs to be retained for
historical analysis or replayed for recovery purposes.

Please see Apache Kafka for more information.

Deploying Kafka on Kubernetes, a widely-used container orchestration platform, offers several additional
advantages. Kubernetes enables dynamic scaling of Kafka clusters based on demand, allowing for efficient
resource utilization and automatic scaling of Kafka brokers to handle changing data stream volumes. This
ensures that Kafka can handle varying workloads without unnecessary resource wastage or performance
degradation.

It provides easy deployment, management, and monitoring Running Kafka clusters as containers provides
easy deployment, management, and monitoring, and makes them highly portable across different
environments. This allows for seamless migration of Kafka clusters across various cloud providers, data
centers, or development environments.

A Collection of Posts on MinIO and Kubernetes | 54


Kubernetes includes built-in features for handling failures and ensuring high availability of Kafka clusters. For
example, it automatically reschedules failed Kafka broker containers and supports rolling updates without
downtime, ensuring continuous availability of Kafka for data streaming applications, thereby enhancing the
reliability and fault tolerance of Kafka deployments.

Kafka and MinIO are commonly used to build data streaming solutions. MinIO is a high-performance,
distributed object storage system designed to support cloud-native applications with S3-compatible storage
for unstructured, semi-structured and structured data. When used as a data sink with Kafka, MinIO enables
organizations to store and process large volumes of data in real-time.

Some benefits of combining Kafka with MinIO include:

■ High Performance: MinIO writes Kafka streams as fast as they come in. A recent benchmark
achieved 325 GiB/s (349 GB/s) on GETs and 165 GiB/s (177 GB/s) on PUTs with just 32 nodes of
off-the-shelf NVMe SSDs.

■ Scalability: MinIO handles large amounts of data and scales horizontally across multiple nodes,
making it a perfect fit for storing data streams generated by Kafka. This allows organizations to store
and process massive amounts of data in real-time, making it suitable for big data and high-velocity
data streaming use cases.
■ Durability: MinIO provides durable storage, allowing organizations to retain data for long periods of
time, such as for historical analysis, compliance requirements, or data recovery purposes.
■ Fault Tolerance: MinIO erasure codes data across multiple nodes, providing fault tolerance and
ensuring data durability. This complements Kafka's fault tolerance capabilities, making the overall
solution highly available, reliable and resilient.
■ Easy Integration: MinIO is easily integrated with Kafka using Kafka Connect, a built-in framework for
connecting Kafka with external systems. This makes it straightforward to stream data from Kafka to
MinIO for storage, and vice versa for data retrieval, enabling seamless data flow between Kafka and
MinIO. We’ll see how straightforward this is in the tutorial below.

In this post, we will walk through how to set up Kafka on Kubernetes using Strimzi, an open-source project
that provides operators to run Apache Kafka and Apache ZooKeeper clusters on Kubernetes, including
distributions such as OpenShift. Then we will use Kafka Connect to stream data to MinIO.

Prerequisites

Before we start, ensure that you have the following:

■ A running Kubernetes cluster


■ kubectl command-line tool
■ A running MinIO cluster
■ mc command line tool for MinIO
■ Helm package manager

A Collection of Posts on MinIO and Kubernetes | 55


Install Strimzi Operator

The first step is to install the Strimzi operator on your Kubernetes cluster. The Strimzi operator manages the
lifecycle of Kafka and ZooKeeper clusters on Kubernetes.

Add the Strimzi Helm chart repository

!helm repo add strimzi https://strimzi.io/charts/

"strimzi" already exists with the same configuration, skipping

Install the chart with release name my-release:

!helm install my-release strimzi/strimzi-kafka-operator --namespace=kafka --create-namespace

NAME: my-release
LAST DEPLOYED: Mon Apr 10 20:03:12 2023
NAMESPACE: kafka
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing strimzi-kafka-operator-0.34.0

To create a Kafka cluster refer to the following documentation.

https://strimzi.io/docs/operators/latest/deploying.html#deploying-cluster-operator-helm-chart-str

This installs the latest version (0.34.0 at the time of this writing) of the operator in the newly created kafka
namespace. For additional configurations refer to this page.

Create Kafka Cluster

Now that we have installed the Strimzi operator, we can create a Kafka cluster. In this example, we will create
a Kafka cluster with three Kafka brokers and three ZooKeeper nodes.

Lets create a YAML file as shown here

%%writefile deployment/kafka-cluster.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:

A Collection of Posts on MinIO and Kubernetes | 56


name: my-kafka-cluster
namespace: kafka
spec:
kafka:
version: 3.4.0
replicas: 3
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
inter.broker.protocol.version: "3.4"
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 100Gi
deleteClaim: false
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 100Gi
deleteClaim: false
entityOperator:
topicOperator: {}
userOperator: {}

Overwriting deployment/kafka-cluster.yaml

A Collection of Posts on MinIO and Kubernetes | 57


Let's create the cluster by deploying the YAML file. We’re deploying a cluster, so it will take some time before
it is up and running

!kubectl apply -f deployment/kafka-cluster.yaml

kafka.kafka.strimzi.io/my-kafka-cluster created

Check the status of the cluster with

Unset
!kubectl -n kafka get kafka my-kafka-cluster

NAME DESIRED KAFKA REPLICAS DESIRED ZK REPLICAS READY


WARNINGS

my-kafka-cluster 3 3 True

Now that we have the cluster up and running, let’s produce and consume sample topic events, starting with
the kafka topic my-topic.

Create Kafka Topic

Create a YAML file for the kafka topic my-topic as shown below and apply it.

%%writefile deployment/kafka-my-topic.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: my-topic
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 3
replicas: 3

Overwriting deployment/kafka-my-topic.yaml

!kubectl apply -f deployment/kafka-my-topic.yaml

A Collection of Posts on MinIO and Kubernetes | 58


kafkatopic.kafka.strimzi.io/connect-offsets created

Check the status of the topic with

!kubectl -n kafka get kafkatopic my-topic

NAME CLUSTER PARTITIONS REPLICATION FACTOR READY


my-topic my-kafka-cluster 3 3 True

Produce and Consume Messages

With the Kafka cluster and topic set up, we can now produce and consume messages.

To create a Kafka producer pod to produce messages to the my-topic topic, try the below commands in a
terminal

kubectl -n kafka run kafka-producer -ti --image=quay.io/strimzi/kafka:0.34.0-kafka-3.4.0 --rm=true


--restart=Never -- bin/kafka-console-producer.sh --broker-list my-kafka-cluster-kafka-bootstrap:9092
--topic my-topic

This will give us a prompt to send messages to the producer. In parallel, we can bring up the consumer to
start consuming the messages that we sent to producer

kubectl -n kafka run kafka-consumer -ti --image=quay.io/strimzi/kafka:0.34.0-kafka-3.4.0 --rm=true


--restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server
my-kafka-cluster-kafka-bootstrap:9092 --topic my-topic --from-beginning

The consumer will replay all the messages that we sent to the producer earlier and, if we add any new
messages to the producer, they will also start showing up at the consumer side.

You can delete the my-topic topic with

!kubectl -n kafka delete kafkatopic my-topic

kafkatopic.kafka.strimzi.io "my-topic" deleted

Now that the Kafka cluster is up and running with a dummy topic producer/consumer, we can start consuming
topics directly into MinIO using the Kafka Connector.

A Collection of Posts on MinIO and Kubernetes | 59


Set Up Kafka Connector with MinIO

Next we will use the Kafka Connector to stream topics directly to MinIO. First let's look at what connectors are
and how to set one up. Here is an high level overview of how the different Kafka Components interact

Kafka Connectors

Kafka Connect is an integration toolkit for streaming data between Kafka brokers and other systems. The
other system is typically an external data source or target, such as MinIO.

Kafka Connect utilizes a plugin architecture to provide implementation artifacts for connectors, which are
used for connecting to external systems and manipulating data. Plugins consist of connectors, data
converters, and transforms. Connectors are designed to work with specific external systems and define a
schema for their configuration. When configuring Kafka Connect, you configure the connector instance, and
the connector instance then defines a set of tasks for data movement between systems.

In the distributed mode of operation, Strimzi operates Kafka Connect by distributing data streaming tasks
across one or more worker pods. A Kafka Connect cluster consists of a group of worker pods, with each
connector instantiated on a single worker. Each connector can have one or more tasks that are distributed
across the group of workers, enabling highly scalable data pipelines.

Workers in Kafka Connect are responsible for converting data from one format to another, making it suitable
for the source or target system. Depending on the configuration of the connector instance, workers may also
apply transforms, also known as Single Message Transforms (SMTs), which can adjust messages, such as by
filtering certain data, before they are converted. Kafka Connect comes with some built-in transforms, but
additional transformations can be provided by plugins as needed.

Kafka Connect uses the following components while streaming data

■ Connectors - create tasks


■ Tasks - move data
■ Workers - run tasks
■ Transformers - manipulate data

A Collection of Posts on MinIO and Kubernetes | 60


■ Converters - convert data

There are 2 types of Connectors

1. Source Connectors - push data into Kafka

2. Sink Connectors - extracts data from Kafka to external source like MinIO

Let’s configure a Sink Connector that extracts data from Kafka and stores it into MinIO as shown below

The Sink Connector streams data from Kafka and goes through following steps

1. A plugin provides the implementation artifacts for the Sink Connector: In Kafka Connect, a Sink
Connector is used to stream data from Kafka to an external system. The implementation artifacts for
the Sink Connector, such as the code and configuration, are provided by a plugin. Plugins are used to
extend the functionality of Kafka Connect and enable connections to different external data systems.

2. A single worker initiates the Sink Connector instance: In a distributed mode of operation, Kafka
Connect runs as a cluster of worker pods. Each worker pod can initiate a Sink Connector instance,
which is responsible for streaming data from Kafka to the external data system. The worker manages
the lifecycle of the Sink Connector instance, including its initialization and configuration.
3. The Sink Connector creates tasks to stream data: Once the Sink Connector instance is initiated, it
creates one or more tasks to stream data from Kafka to the external data system. Each task is
responsible for processing a portion of the data and can run in parallel with other tasks for efficient
data processing.

A Collection of Posts on MinIO and Kubernetes | 61


4. Tasks run in parallel to poll Kafka and return records: The tasks retrieve records from Kafka topics
and prepare them for forwarding to the external data system. The parallel processing of tasks
enables high throughput and efficient data streaming.
5. Converters put the records into a format suitable for the external data system: Before forwarding the
records to the external data system, converters are used to put the records into a format that is
suitable for the specific requirements of the external data system. Converters handle data format
conversion, such as from Kafka's binary format to a format supported by the external data system.
6. Transforms adjust the records, such as filtering or relabeling them: Depending on the configuration of
the Sink Connector, transformations, Single Message Transforms (SMTs), can be applied to adjust the
records before they are forwarded to the external data system. Transformations can be used for
tasks such as filtering, relabeling, or enriching the data to be sent to the external system.
7. The sink connector is managed using KafkaConnectors or the Kafka Connect API: The Sink
Connector, along with its tasks, is managed using KafkaConnectors, or through the Kafka Connect
API, which provides programmatic access for managing Kafka Connect. This allows for easy
configuration, monitoring, and management of Sink Connectors and their tasks in a Kafka Connect
deployment.

Setup

We will create a simple example which will perform the following steps

1. Create a Producer that will stream data from MinIO and produce events for a topic in JSON format

2. Build a Kafka Connect Image that has S3 dependencies


3. Deploy the Kafka Connect based on the above image
4. Deploy Kafka sink connector that consumes kafka topic and stores the data MinIO bucket

Getting Demo Data into MinIO

We will be using the NYC Taxi dataset that is available on MinIO. If you don't have the dataset follow the
instructions here

Producer

Below is a simple Python code that consumes data from MinIO and produces events for the topic my-topic

%%writefile sample-code/producer/src/producer.py
import logging
import os

import fsspec
import pandas as pd
import s3fs

A Collection of Posts on MinIO and Kubernetes | 62


from kafka import KafkaProducer

logging.basicConfig(level=logging.INFO)

producer = KafkaProducer(bootstrap_servers="my-kafka-cluster-kafka-bootstrap:9092")

fsspec.config.conf = {
"s3":
{
"key": os.getenv("AWS_ACCESS_KEY_ID", "openlakeuser"),
"secret": os.getenv("AWS_SECRET_ACCESS_KEY", "openlakeuser"),
"client_kwargs": {
"endpoint_url": "https://play.min.io:50000"
}
}
}
s3 = s3fs.S3FileSystem()
total_processed = 0
i=1
for df in pd.read_csv('s3a://openlake/spark/sample-data/taxi-data.csv', chunksize=1000):
count = 0
for index, row in df.iterrows():
producer.send("my-topic", bytes(row.to_json(), 'utf-8'))
count += 1
producer.flush()
total_processed += count
if total_processed % 10000 * i == 0:
logging.info(f"total processed till now {total_processed}")
i += 1

Overwriting sample-code/src/producer.py

adds requirements and Dockerfile based on which we will build the docker image

A Collection of Posts on MinIO and Kubernetes | 63


%%writefile sample-code/producer/requirements.txt
pandas==2.0.0
s3fs==2023.4.0
pyarrow==11.0.0
kafka-python==2.0.2

Overwriting sample-code/producer/requirements.txt

In [14]:
%%writefile sample-code/producer/Dockerfile
FROM python:3.11-slim

ENV PYTHONDONTWRITEBYTECODE=1

COPY requirements.txt .
RUN pip3 install -r requirements.txt

COPY src/producer.py .
CMD ["python3", "-u", "./producer.py"]

Overwriting sample-code/Dockerfile

Build and push the Docker image for the producer using the above Docker file or you can use the one
available in openlake openlake/kafka-demo-producer

Let's create a YAML file that deploys our producer in the Kubernetes cluster as a job

%%writefile deployment/producer.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: producer-job
namespace: kafka
spec:
template:
metadata:
name: producer-job
spec:
containers:
- name: producer-job

A Collection of Posts on MinIO and Kubernetes | 64


image: openlake/kafka-demo-producer:latest
restartPolicy: Never

Writing deployment/producer.yaml

Deploy the producer.yaml file

In [84]:
!kubectl apply -f deployment/producer.yaml

job.batch/producer-job created

Check the logs by using the below command

In [24]:
!kubectl logs -f job.batch/producer-job -n kafka # stop this shell once you are done

<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)


<jemalloc>: (This is the expected behaviour if you are running under QEMU)
INFO:kafka.conn:<BrokerConnection node_id=bootstrap-0 host=my-kafka-cluster-kafka-bootstrap:9092
<connecting> [IPv4 ('10.96.4.95', 9092)]>: connecting to my-kafka-cluster-kafka-bootstrap:9092
[('10.96.4.95', 9092) IPv4]
INFO:kafka.conn:Probing node bootstrap-0 broker version
INFO:kafka.conn:<BrokerConnection node_id=bootstrap-0 host=my-kafka-cluster-kafka-bootstrap:9092
<connecting> [IPv4 ('10.96.4.95', 9092)]>: Connection complete.
INFO:kafka.conn:Broker version identified as 2.5.0
INFO:kafka.conn:Set configuration api_version=(2, 5, 0) to skip auto check_version requests on startup
INFO:kafka.conn:<BrokerConnection node_id=0
host=my-kafka-cluster-kafka-0.my-kafka-cluster-kafka-brokers.kafka.svc:9092 <connecting> [IPv4
('10.244.1.4', 9092)]>: connecting to
my-kafka-cluster-kafka-0.my-kafka-cluster-kafka-brokers.kafka.svc:9092 [('10.244.1.4', 9092) IPv4]
INFO:kafka.conn:<BrokerConnection node_id=0
host=my-kafka-cluster-kafka-0.my-kafka-cluster-kafka-brokers.kafka.svc:9092 <connecting> [IPv4
('10.244.1.4', 9092)]>: Connection complete.
INFO:kafka.conn:<BrokerConnection node_id=bootstrap-0 host=my-kafka-cluster-kafka-bootstrap:9092
<connected> [IPv4 ('10.96.4.95', 9092)]>: Closing connection.
INFO:root:total processed till now 10000
rpc error: code = NotFound desc = an error occurred when try to find container
"85acfb121b7b63bf0f46d9ef89aed9b05666b3fb86b4a835e9d2ebf67c6943f9": not found

A Collection of Posts on MinIO and Kubernetes | 65


Now that we have our basic producer sending JSON events to my-topic, let’s deploy Kafka Connect and the
corresponding Connector that stores these events in MinIO.

Build Kafka Connect Image

Let's build a Kafka Connect image that has S3 dependencies

%%writefile sample-code/connect/Dockerfile
FROM confluentinc/cp-kafka-connect:7.0.9 as cp
RUN confluent-hub install --no-prompt confluentinc/kafka-connect-s3:10.4.2
RUN confluent-hub install --no-prompt confluentinc/kafka-connect-avro-converter:7.3.3
FROM quay.io/strimzi/kafka:0.34.0-kafka-3.4.0
USER root:root
# Add S3 dependency
COPY --from=cp /usr/share/confluent-hub-components/confluentinc-kafka-connect-s3/
/opt/kafka/plugins/kafka-connect-s3/

Overwriting sample-code/connect/Dockerfile

Build and push the Docker image for the producer using the above Dockerfile or can use the one available in
openlake openlake/kafka-connect:0.34.0

Before we deploy Kafka Connect, we need to create storage topics if not already present for Kafka Connect
to work as expected.

Create Storage Topics

Lets create connect-status, connect-configs and connect-offsets topics and deploy them as shown below

%%writefile deployment/connect-status-topic.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: connect-status
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 1
replicas: 3

A Collection of Posts on MinIO and Kubernetes | 66


config:
cleanup.policy: compact

Writing deployment/connect-status-topic.yaml

In [73]:
%%writefile deployment/connect-configs-topic.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: connect-configs
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 1
replicas: 3
config:
cleanup.policy: compact

Writing deployment/connect-configs-topic.yaml

In [74]:
%%writefile deployment/connect-offsets-topic.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: connect-offsets
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 1
replicas: 3
config:
cleanup.policy: compact

A Collection of Posts on MinIO and Kubernetes | 67


Writing deployment/connect-offsets-topic.yaml

Deploy above topics

In [ ]:
!kubectl apply -f deployment/connect-status-topic.yaml
!kubectl apply -f deployment/connect-configs-topic.yaml
!kubectl apply -f deployment/connect-offsets-topic.yaml

Deploy Kafka Connect

Next, create a YAML file for Kafka Connect that uses the above image and deploys it in Kubernetes. Kafka
Connect will have 1 replica and make use of the storage topics we created above.

NOTE: spec.template.connectContainer.env has the credentials defined in order for Kafka Connect to store
data in the Minio cluster. Other details like the endpoint_url, bucket_name will be part of KafkaConnector

In [75]:
%%writefile deployment/connect.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
name: connect-cluster
namespace: kafka
annotations:
strimzi.io/use-connector-resources: "true"
spec:
image: openlake/kafka-connect:0.34.0
version: 3.4.0
replicas: 1
bootstrapServers: my-kafka-cluster-kafka-bootstrap:9093
tls:
trustedCertificates:
- secretName: my-kafka-cluster-cluster-ca-cert
certificate: ca.crt
config:
bootstrap.servers: my-kafka-cluster-kafka-bootstrap:9092
group.id: connect-cluster
key.converter: org.apache.kafka.connect.json.JsonConverter
value.converter: org.apache.kafka.connect.json.JsonConverter
internal.key.converter: org.apache.kafka.connect.json.JsonConverter
internal.value.converter: org.apache.kafka.connect.json.JsonConverter

A Collection of Posts on MinIO and Kubernetes | 68


key.converter.schemas.enable: false
value.converter.schemas.enable: false
offset.storage.topic: connect-offsets
offset.storage.replication.factor: 1
config.storage.topic: connect-configs
config.storage.replication.factor: 1
status.storage.topic: connect-status
status.storage.replication.factor: 1
offset.flush.interval.ms: 10000
plugin.path: /opt/kafka/plugins
offset.storage.file.filename: /tmp/connect.offsets
template:
connectContainer:
env:
- name: AWS_ACCESS_KEY_ID
value: "openlakeuser"
- name: AWS_SECRET_ACCESS_KEY
value: "openlakeuser"

Writing deployment/connect.yaml

In [87]:
!kubectl apply -f deployment/connect.yaml

kafkaconnect.kafka.strimzi.io/connect-cluster created

Deploy Kafka Sink Connector

Now that we have Kafka Connect up and running, the next step is to deploy the Sink Connector that will poll
my-topic and store data into the MinIO bucket openlake-tmp.

connector.class - specifies what type of connector the Sink Connector will use, in our case it is
io.confluent.connect.s3.S3SinkConnector

store.url - MinIO endpoint URL where you want to store the data from Kafka Connect

storage.class - specifies which storage class to use, in our case we are storing in MinIO so
io.confluent.connect.s3.storage.S3Storage will be used

A Collection of Posts on MinIO and Kubernetes | 69


format.class - Format type to store data in MinIO, since we would like to store JSON we will use
io.confluent.connect.s3.format.json.JsonFormat

In [90]:
%%writefile deployment/connector.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: "minio-connector"
namespace: "kafka"
labels:
strimzi.io/cluster:
connect-cluster
spec:
class: io.confluent.connect.s3.S3SinkConnector
config:
connector.class: io.confluent.connect.s3.S3SinkConnector
task.max: '1'
topics: my-topic
s3.region: us-east-1
s3.bucket.name: openlake-tmp
s3.part.size: '5242880'
flush.size: '1000'
store.url: https://play.min.io:50000
storage.class: io.confluent.connect.s3.storage.S3Storage
format.class: io.confluent.connect.s3.format.json.JsonFormat
partitioner.class: io.confluent.connect.storage.partitioner.DefaultPartitioner
behavior.on.null.values: ignore

Overwriting deployment/connector.yaml

In [89]:
!kubectl apply -f deployment/connector.yaml

kafkaconnector.kafka.strimzi.io/minio-connector created

We can see files being added to the Minio openlake-tmp bucket with

A Collection of Posts on MinIO and Kubernetes | 70


In [79]:
!mc ls --summarize --recursive play/openlake-tmp/topics/my-topic

]11;?\[2023-04-11 19:53:29 PDT] 368KiB STANDARD partition=0/my-topic+0+0000000000.json


[2023-04-11 19:53:30 PDT] 368KiB STANDARD partition=0/my-topic+0+0000001000.json

[...TRUNCATED…]

[2023-04-11 19:54:07 PDT] 368KiB STANDARD partition=0/my-topic+0+0000112000.json


[2023-04-11 19:54:08 PDT] 368KiB STANDARD partition=0/my-topic+0+0000113000.json
[2023-04-11 19:54:08 PDT] 368KiB STANDARD partition=0/my-topic+0+0000114000.json

Total Size: 41 MiB


Total Objects: 115

We created an end-to-end implementation of producing topics in Kafka and consuming it directly into MinIO
using the Kafka Connectors. This is a great start learning how to use MinIO and Kafka together to build a
streaming data repository. But wait, there’s more.

In my next post, I explain and show you how to take this tutorial and turn it into something that is a lot more
efficient and performant.

Achieve Streaming Success with Kafka and MinIO

This blog post showed you how to get started building a streaming data lake. Of course, there are many more
steps involved between this beginning and production.

MinIO is cloud-native object storage that forms the foundation for ML/AI, analytics, streaming video, and
other demanding workloads running in Kubernetes. MinIO scales seamlessly, ensuring that you can simply
expand storage to accommodate a growing data lake.

Customers frequently build data lakes using MinIO and expose them to a variety of cloud-native applications
for business intelligence, dashboarding and other analysis. They build them using Apache Iceberg, Apache
Hudi and Delta Lake. They use Snowflake, SQL Server, or a variety of databases to read data saved in MinIO
as external tables. And they use Dremio, Apache Druid and Clickhouse for analytics, and Kubeflow and
Tensorflow for ML.

MinIO can even replicate data between clouds to leverage specific applications and frameworks, while it is
protected using access control, version control, encryption and erasure coding.

Don’t take our word for it though — build it yourself. You can download MinIO and you can join our Slack
channel.

A Collection of Posts on MinIO and Kubernetes | 71


Dremio and MinIO on Kubernetes for Fast Scalable
Analytics
Dileeshvar Radhakrishnan 13 April 2023

Cloud native object stores such as MinIO are frequently used to build data lakes that house large structured,
semi-structured and unstructured data in a central repository. Data lakes usually contain raw data obtained
from multiple sources, including streaming and ETL. Organizations analyze this data to spot trends and
measure the health of the business.

What is Dremio?

Dremio is an open-source, distributed analytics engine that provides a simple, self-service interface for data
exploration, transformation, and collaboration. Dremio's architecture is built on top of Apache Arrow, a
high-performance columnar memory format, and leverages the Parquet file format for efficient storage. For
more on Dremio, please see Getting Started with Dremio.

MinIO for Cloud-Native Data Lakes

MinIO is a high-performance, distributed object storage system designed for cloud-native applications. The
combination of scalability and high-performance puts every workload, no matter how demanding, within
reach. A recent benchmark achieved 325 GiB/s (349 GB/s) on GETs and 165 GiB/s (177 GB/s) on PUTs with
just 32 nodes of off-the-shelf NVMe SSDs.

MinIO is built to power data lakes and the analytics and AI that runs on top of them. MinIO includes a number
of optimizations for working with large datasets consisting of many small files, a common occurrence with any
of today’s open table formats.

Perhaps more importantly for data lakes, MinIO guarantees durability and immutability. In addition, MinIO
encrypts data in transit and on drives, and regulates access to data using IAM and policy based access
controls (PBAC).

A Collection of Posts on MinIO and Kubernetes | 72


Set up Dremio OSS in Kubernetes

We can use Helm charts to deploy Dremio in a Kubernetes cluster. In this scenario, we will use the Dremio
OSS (Open Source Software) image to deploy one Master, three Executors and three Zookeepers. The Master
node coordinates the cluster and the Executors processing data. By deploying multiple Executors, we can
parallelize data processing and improve cluster performance.

We’ll use a MinIO bucket to store the data. New files uploaded to Dremio are stored in the MinIO bucket. This
enables us to store and process large amounts of data in a scalable and distributed manner.

Prerequisites

To follow these instructions, you will need

■ A Kubernetes cluster. You can use Minikube or Kind to set up a local Kubernetes cluster.

■ Helm, the package manager for Kubernetes. You can follow this guide to install Helm on your
machine.
■ A MinIO server running on bare metal or kubernetes, or you can use our Play server for testing
purposes.
■ A MinIO client (mc) to access the MinIO server. You can follow this guide to install mc on your
machine.

Clone minio/openlake repo

MinIO engineers put together the openlake repository to give you the tools to build open source data lakes.
The overall goal of this repository is to guide you through the steps needed to build a data lake using open
source tools like Apache Spark, Apache Kafka, Trino, Apache Iceberg, Apache Airflow, and other tools
deployed on Kubernetes with MinIO as the object store.

Unset
!git clone https://github.com/minio/openlake

Create MinIO bucket

Let's create a MinIO bucket openlake/dremio which will be used by Dremio as the distributed storage

!mc mb play/openlake
!mc mb play/openlake/dremio

A Collection of Posts on MinIO and Kubernetes | 73


Clone dremio-cloud-tools repo

We will use the helm charts from the Dremio repo to set it up

!git clone https://github.com/dremio/dremio-cloud-tools

We will use the dremio_v2 version of the charts, and we will use the values.minio.yaml file in the Dremio
directory of the openlake repository to set up Dremio. Let’s copy the YAML to
dremio-cloud-tools/charts/dremio_v2 and then confirm that it has been copied

!cp ~/openlake/dremio/charts/values.minio.yaml ~/dremio-cloud-tools/charts/dremio_v2/


!ls ~/dremio-cloud-tools/charts/dremio_v2/

Deployment Details

If we take a deep dive into the values.minio.yaml file (feel free to cat or open the file in your editor of
choice), we’ll gain a greater understanding of our deployment and learn about some of the modifications
made to the distStorage section

distStorage:
aws:
bucketName: "openlake"
path: "/dremio"
authentication: "accessKeySecret"
credentials:
accessKey: "minioadmin"
secret: "minioadmin"

extraProperties: |
<property>
<name>fs.s3a.endpoint</name>
<value>play.min.io</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
<property>
<name>dremio.s3.compat</name>

A Collection of Posts on MinIO and Kubernetes | 74


<value>true</value>
</property>

We set the distStorage to aws, the name of the bucket is openlake and all the storage for Dremio will be
under the prefix dremio (aka s3://openlake/dremio). We also need to add extraProperties since we
are specifying the MinIO Endpoint. We also need to add two additional properties in order to make Dremio
work with MinIO, fs.s3a.path.style.access needs to be set to true and dremio.s3.compat must be set
to true so that Dremio knows this is an S3 compatible object store.

Apart from this we can customize multiple other configurations like executor CPU and Memory usage
depending on the Kubernetes cluster capacity. We can also specify how many executors we need depending
on the size of the workloads Dremio is going to handle.

Install Dremio using Helm

Make sure to update your Minio endpoint, access key and secret key in values.minio.yaml. The commands
below will install the Dremio release named dremio in the newly created namespace dremio.

!cd ~/dremio-cloud-tools/charts

!helm install dremio dremio_v2 -f dremio_v2/values.minio.yaml --namespace dremio --create-namespace

Give Helm a few minutes to work its magic, then verify that Dremio was installed and is running

!kubectl -n dremio get pods # after the helm setup is complete it takes some time for the pods to be up and running
!kubectl -n dremio get svc # List all the services in namespace dremio
!mc ls play/openlake/dremio # we should see new prefixes being created that Dremio will use later

Log in to Dremio

To log in to Dremio, let’s open a port-forward for the dremio-client service to our localhost. After executing
the below command, point your browser at http://localhost:9047. For security purposes, please remember to
close the port-forward after you are finished exploring Dremio.

!kubectl -n dremio port-forward svc/dremio-client 9047

You will need to create a new user when you first launch Dremio

A Collection of Posts on MinIO and Kubernetes | 75


Once we have created the user, we will be greeted with a welcome page. To keep this workflow simple, let's
upload a sample dataset to Dremio that is included in the openlake repo data/nyc_taxi_small.csv and
start querying it.

We can upload openlake/dremio/data/nyc_taxi_small.csv by clicking on the + at the top right corner


of the home page, as shown below

Dremio will automatically parse the CSV and provide the recommended formatting as shown below, click Save
to proceed.

A Collection of Posts on MinIO and Kubernetes | 76


To verify that the CSV file was uploaded to the MinIO bucket

!mc ls --summarize --recursive openlake/dremio/uploads # you will see the CSV file uploaded in to the MinIO
bucket

After loading the file, we will be taken to the SQL Query Console where we can start executing queries. Here
are 2 sample queries that you can try executing

Unset
SELECT count(*) FROM nyc_taxi_small;

SELECT * FROM nyc_taxi_small;

Paste the above in the console and click Run, then you see something like below

A Collection of Posts on MinIO and Kubernetes | 77


You can click on the Query1 tab to see the number of rows in the dataset

You can click on the Query2 tab to see the number of rows in the dataset

A Collection of Posts on MinIO and Kubernetes | 78


Data Lakes and Dremio

This blog post walked you through deploying Dremio in a Kubernetes cluster and using MinIO as the
distributed storage. We also saw how to upload a sample dataset to Dremio and start querying it. We have
😜
just touched the tip of the iceberg in this post to help you get started building your data lake.

Speaking of icebergs, Apache Iceberg is an open table format that was built for object storage. Many a data
lake has been built using the combination of Dremio, Spark, Iceberg, and MinIO. To learn more, please see The
Definitive Guide to Lakehouse Architecture with Iceberg and MinIO.

Try Dremio on MinIO today. If you have any questions or want to share tips, please reach out through our
Slack channel or drop us a note on hello@min.io.

A Collection of Posts on MinIO and Kubernetes | 79


Simplifying Multi-Cloud Kubernetes with MinIO and
Rafay
Matt Sarrel, AJ 22 March 2023

Enterprises are deploying multi-cloud services on a scale we’ve never seen before. Kubernetes is a key
enabler of multi-cloud success because it establishes a common, declarative software-based platform that
provides a consistent API-driven experience regardless of underlying hardware and software. However, it can
be time consuming and error prone to manage a multitude of Kubernetes clusters and their applications and
data across the multi-cloud.

It’s no secret that managing Kubernetes manually requires considerable skill to scale effectively. Challenges
grow as you scale because you’re supporting more and bigger Kubernetes clusters. At some point,
Kubernetes’ complexity may even threaten your ability to adapt legacy software to the cloud-native age.
Adding external storage to the mix compounds those challenges, especially when you have to deal with
variations in hardware and inconsistent APIs. If you are not architected for the multi-cloud, you run the risk of
failing in the multi-cloud.

We’ve joined forces with Rafay to develop this tutorial to show you how to make the most of multi-cloud
Kubernetes using Rafay to deploy, update and manage Kubernetes and applications using MinIO for object
storage. Rafay is a SaaS-based Kubernetes operations solution that standardizes, configures, monitors,
automates and manages a set of Kubernetes clusters through a single interface. MinIO is the fastest
software-defined, Kubernetes native, object store. It includes replication, integrations, automations and runs
anywhere Kubernetes does – public/private cloud, edge, developer laptops and more.

MinIO brings S3 API functionality and object storage to Kubernetes, providing a consistent interface anywhere
you run Kubernetes. DevOps and platform teams use the MinIO Operator and kubectl plugin to deploy and
manage object storage across the multi-cloud. Cloud-native MinIO integrates with external identity
management, encryption key management, load balancing, certificate management and monitoring and
alerting applications and services – it simply works with whatever you're already using in your organization.
MinIO is frequently used to build data lakes/lakehouses, at the edge and to deliver Object Storage as a
Service in the datacenter.

MinIO and Rafay are both known for their combination of power and simplicity. Follow the tutorial below to
begin exploring how they can standardize and automate operations for your Kubernetes clusters and manage
its applications and data.

A Collection of Posts on MinIO and Kubernetes | 80


Rafay Install

We need a Kubernetes cluster to get started on our endeavor. Regular EKS or GKE clusters would work but
on-prem bare metal Kubernetes clusters would work as well. Our ethos has always been simplicity where
anyone can get started with just their laptop and grow production systems from there. We’ll use our laptops
for this tutorial in order to demonstrate the simplicity of Rafay and MinIO.

Download MicroK8s

Let’s start with by installing MicroK8s using brew

% brew install ubuntu/microk8s/microk8s

....

==> microk8s
Run `microk8s install` to start with MicroK8s

Install MicroK8s

% microk8s install

% microk8s kubectl get namespace


NAME STATUS AGE
kube-system Active 92s
kube-public Active 92s
kube-node-lease Active 92s
default Active 90s

Add a shortcut alias in bash so you do not have to repeat the entire command everytime

% vim ~/.bash_profile

alias mk8s="microk8s kubectl"

Check to see the alias is working

A Collection of Posts on MinIO and Kubernetes | 81


% mk8s get ns
NAME STATUS AGE
kube-system Active 5m34s
kube-public Active 5m34s
kube-node-lease Active 5m34s
default Active 5m32s

Great, if that is working, lets move on to enabling some essential addons required for the operations of our
cluster.

Enable DNS, StorageClass and RBAC

In order for the pods in the MicroK8s cluster to talk internally and to route external DNS requests, let's enable
DNS, which is essentially managed by CoreDNS. In order to have a persistent volume for our MinIO
installation, we’ll enable Microk8s hostpath storage. Last but not least, we also need RBAC to securely enable
access to Calico for routing and other internal user based kubectl access configured using Rafay console.

Enable DNS, hostpath storage and RBAC MicroK8s add-ons.

% microk8s enable dns


Infer repository core for addon dns
Enabling DNS
No valid resolv.conf file could be found
Falling back to 8.8.8.8 8.8.4.4 as upstream nameservers
Applying manifest
serviceaccount/coredns created
configmap/coredns created
deployment.apps/coredns created
service/kube-dns created
clusterrole.rbac.authorization.k8s.io/coredns created
clusterrolebinding.rbac.authorization.k8s.io/coredns created
Restarting kubelet
DNS is enabled

% microk8s enable hostpath-storage


Infer repository core for addon hostpath-storage
Enabling default storage class.
WARNING: Hostpath storage is not suitable for production environments.

deployment.apps/hostpath-provisioner created
storageclass.storage.k8s.io/microk8s-hostpath created
serviceaccount/microk8s-hostpath created

A Collection of Posts on MinIO and Kubernetes | 82


clusterrole.rbac.authorization.k8s.io/microk8s-hostpath created
clusterrolebinding.rbac.authorization.k8s.io/microk8s-hostpath created
Storage will be available soon.

% microk8s enable rbac


Infer repository core for addon rbac
Enabling RBAC
Reconfiguring apiserver
RBAC is enabled

Verify DNS is enabled

% mk8s get po -A
NAMESPACE NAME READY STATUSRESTARTS AGE
kube-system calico-kube-controllers-869878fccf-84l9q 1/1 Running 0 15m
kube-system calico-node-x4xsj 1/1 Running 0 15m
kube-system coredns-6f5f9b5d74-p4skc

MinIO Cluster

There are a couple of ways to get our MinIO Kubernetes cluster connected to Rafay. We can either go to the
Rafay console and launch the cluster on AWS, GCP, Azure and even bare metal to import an already running
existing Kubernetes into the Rafay console. In this case, we already have a running MicroK8s Kubernetes
cluster, so we’ll go ahead and import that.

Follow steps 1 and 2 on this page to import the MicroK8s cluster we set up locally. Once you are on step 3,
you’ll get a bootstrap yaml file which you need to apply to the Microk8s cluster.

Bootstrap Cluster

Install Rafay operator and bootstrap the cluster

% mk8s apply -f ~/Downloads/mk8sdesktop-bootstrap.yaml [346/1984]


namespace/rafay-system created
serviceaccount/system-sa created
clusterrole.rbac.authorization.k8s.io/rafay:manager created
clusterrolebinding.rbac.authorization.k8s.io/rafay:rafay-system:manager-rolebinding created
clusterrole.rbac.authorization.k8s.io/rafay:proxy-role created
clusterrolebinding.rbac.authorization.k8s.io/rafay:rafay-system:proxy-rolebinding created

A Collection of Posts on MinIO and Kubernetes | 83


priorityclass.scheduling.k8s.io/rafay-cluster-critical-v3 created
priorityclass.scheduling.k8s.io/rafay-cluster-critical created
role.rbac.authorization.k8s.io/rafay:leader-election-role created
rolebinding.rbac.authorization.k8s.io/rafay:leader-election-rolebinding created
customresourcedefinition.apiextensions.k8s.io/namespaces.cluster.rafay.dev created
customresourcedefinition.apiextensions.k8s.io/tasklets.cluster.rafay.dev created
customresourcedefinition.apiextensions.k8s.io/tasks.cluster.rafay.dev created
service/controller-manager-metrics-service-v4 created
deployment.apps/controller-manager-v3 created
configmap/connector-config-v3 created
configmap/proxy-config-v3 created
deployment.apps/rafay-connector-v3 created
service/rafay-drift-v3 created
validatingwebhookconfiguration.admissionregistration.k8s.io/rafay-drift-validate-v3 created

Once you apply the bootstrap file, it will take about 5 minutes for all the pods to come up

% mk8s get po -n rafay-system


NAME READY STATUSRESTARTS AGE
relay-agent-75bb76cc64-wxjmh 1/1 Running 0 3m
rafay-connector-v3-c965fc7cf-pjx9x 1/1 Running 0 96s
controller-manager-v3-58cf8f6445-mv55l 1/1 Running 0 95s
edge-client-767b87fb5-44fpn 1/1 Running 0 70s

In the reachability check you should see SUCCESS and the control plane should look HEALTHY.

Deploy MinIO

There are several ways to deploy MinIO: Using the go binary and systemctl file, in Kubernetes as an
operator, and also using a Helm chart. We’ll use a Helm chart in this example to show the workflow in the
Rafay console to import a helm chart.

In order to get started, add the MinIO Helm chart repository

A Collection of Posts on MinIO and Kubernetes | 84


% helm repo add minio https://helm.min.io/

Download the MinIO tar.gz helm chart package, which will later be used to upload to Rafay console.

% helm fetch minio/minio

Create the following minio-custom-values.yaml file to upload later in Rafay console

## Enable persistence using Persistent Volume Claims


#
persistence:
#Specify the size for MinIO Storage
size: 50Gi

## Configure resource requests and limits for your MinIO container


##
resources:
requests:
memory: 2Gi
cpu: 500m
limits:
memory: 4Gi
cpu: 1

## Enable and configure ingress to expose MinIO service externally


##
ingress:
enabled: false
annotations:
# Add annotation to use built-in nginx ingress controller
kubernetes.io/ingress.class: nginx
# Add annotation to use cert-manager for generating and maintaining the cert for MinIO ingress
cert-manager.io/cluster-issuer: "letsencrypt-http"
path: /
hosts:
# Change the host to your domain
- minio.ajtest.local
tls:
- secretName: minio-ingress-tls
hosts:
- minio.ajtest.local

## Change below settings if you would like to use K8S secrets for the MinIO's access and secret key
## Remove this if you are planning to use the Vault integration
##
existingSecret: ""
accessKey: "minioadmin"
secretKey: "minioadmin"

A Collection of Posts on MinIO and Kubernetes | 85


Create the MinIO Namespace using the Rafay interface

Verify it’s been created in your cluster

% mk8s get ns
NAME STATUS AGE
default Active 2d
kube-system Active 2d
kube-public Active 2d
kube-node-lease Active 2d
rafay-system Active 2d
minio Active 4s

We have all the prerequisites now: Helm Chart tar.gz, Helm Values yaml file and Namespce to deploy on the
cluster. Next, create a new workload, name it “minio” with package type “Helm 3”. This is helpful because it
tells the Rafay console to use the Helm prerequisite files we created earlier. Select “Upload files manually” to
upload the helm chart tar.gz and the helm values yaml file.

Select the MinIO package and values yaml file and Publish the workload

A Collection of Posts on MinIO and Kubernetes | 86


Give it a few minutes and then verify the workload is ready

Click on Debug -> Pods to see the MinIO pod running

MinIO Console

Next, expose the MinIO Console, a browser-based GUI for managing a MinIO Tenant, using Kubernetes port
forwarding.

A Collection of Posts on MinIO and Kubernetes | 87


Open a port-forward and go to http://localhost:9000

% mk8s -n minio port-forward service/minio 9000:9000 --address 0.0.0.0


Forwarding from 0.0.0.0:9000 -> 9000
Handling connection for 9000
Handling connection for 9000
Handling connection for 9000

Log in to the MinIO Console using the credentials that were set in the Helm chart’s values file.

On the bottom right click on the + to create a bucket

Let’s name this “testbucket”

A Collection of Posts on MinIO and Kubernetes | 88


Add a test object to the bucket we just created by clicking the Upload icon on the bottom right of the screen.

Seamless, Simple and Streamlined for Multi-Cloud Kubernetes


Success

At MiniO, we always strive to make our software as seamless and straightforward as possible. It starts with
detailed, easy-to-read documentation and single-command deployment. You get software-defined object
storage that runs anywhere from a developer’s laptop to production Kubernetes or bare metal clusters
combined with the simplicity of the browser-based MinIO Console user interface. A commercial subscription
adds access to the MinIO Subscription Network and ties it together,with real-time collaboration with our
engineers on our revolutionary SUBNET portal. This tutorial shows you how to work with MinIO object storage
and Rafay System’s management console for Kubernetes to set up Kubernetes workloads on a Microk8s
cluster. Once the necessary operators are installed, you will be able to see the status of your locally running
Microk8s cluster in the Rafay console.

This short tutorial can be run on a laptop to demonstrate how quick and easy it is to get started with MinIO
and Rafay. Once you’ve completed this tutorial, you’ll see how simple it is to manage your MinIO object
storage deployments with Rafay Systems. You can focus on running your MinIO clusters in multiple locations
and connect them all back to be managed and monitored by Rafay Systems in a single pane of glass view.

With the average enterprise running 100s of Kubernetes clusters across more than 2 locations, it gives them
the team autonomy and reliability with the combination of MinIO and Rafay Systems laying the groundwork for
successfully deploying and maintaining applications across your entire multi-cloud presence.

A Collection of Posts on MinIO and Kubernetes | 89


Spark, MinIO and Kubernetes
Dileeshvar Radhakrishnan 6 March 2023

Apache Spark is an open-source, distributed computing system used for big data processing and analytics. It
is designed to handle large-scale data processing with speed, efficiency and ease of use. Spark provides a
unified analytics engine for large-scale data processing, with support for multiple languages, including Java,
Scala, Python, and R.

The benefits of using Spark are numerous. First, it provides a high level of parallelism, which means that it can
process large amounts of data quickly and efficiently across multiple nodes in a cluster. Second, Spark
provides a rich set of APIs for data processing, including support for SQL queries, machine learning, graph
processing, and stream processing. Third, Spark has a flexible and extensible architecture that allows
developers to easily integrate with various data sources and other tools.

When running Spark jobs, it is crucial to use a suitable storage system to store the input and output data.
Object storage systems like MinIO are the only way to run Spark jobs against petabytes of data as they are
highly scalable and durable storage solutions. MinIO is an open-source object storage system that can be
easily deployed on-premises or in the cloud of your choice. With industry leading S3-compatibility, MinIO is
used with a wide range of tools that support the S3 API, including Spark.

Using MinIO with Spark provides several benefits over traditional Hadoop Distributed File System (HDFS) or
other file-based storage systems. MinIO is highly scalable and can handle large amounts of data, as in
petabytes, with ease. Capable of over 2.6Tbps for READS and 1.32Tbps for WRITES, MinIO provides the
performance-at-scale that is needed to support large Spark datasets. MinIO is a flexible and cost-effective
storage solution that can be easily integrated with other tools and systems. Data written to MinIO is
immutable and versioned, as well as highly durable, with multiple copies of erasure coded data stored across
multiple nodes for redundancy and fault tolerance. Rounding out functionality, Active-Active replication and
Batch Replication can be used for further redundancy and fault tolerance, or simply to move data where it can
best be used.

Why Spark on Kubernetes?

Deploying Apache Spark on Kubernetes offers several advantages over deploying it standalone. Here are
some reasons why:

1. Resource management: Kubernetes provides powerful resource management capabilities that can
help optimize resource utilization and minimize waste. By deploying Spark on Kubernetes, you can

A Collection of Posts on MinIO and Kubernetes | 90


take advantage of Kubernetes’ resource allocation and scheduling features to allocate resources to
Spark jobs dynamically, based on their needs.

2. Scalability: Kubernetes can automatically scale the resources allocated to Spark based on the
workload. This means that Spark can scale up or down depending on the amount of data it needs to
process, without the need for manual intervention.
3. Fault-tolerance: Kubernetes provides built-in fault tolerance mechanisms that ensure the reliability of
Spark clusters. If a node in the cluster fails, Kubernetes automatically reschedules the Spark tasks to
another node, ensuring that the workload is not impacted.
4. Simplified deployment: Kubernetes offers a simplified deployment model, where you can deploy
Spark using a single YAML file. This file specifies the resources required for the Spark cluster, and
Kubernetes automatically handles the rest.
5. Integration with other Kubernetes services: By deploying Spark on Kubernetes, you can take
advantage of other Kubernetes services, such as monitoring and logging, to gain greater visibility into
your Spark cluster's performance and health.

Set Up Spark on Kubernetes

We will use Spark Operator to set up Spark on Kubernetes. Spark Operator is a Kubernetes controller that
allows you to manage Spark applications on Kubernetes. It provides a custom resource definition (CRD) called
SparkApplication, which allows you to define and run Spark applications on Kubernetes. Spark Operator also
provides a web UI that allows you to easily monitor and manage Spark applications. Spark Operator is built on
top of the Kubernetes Operator SDK, which is a framework for building Kubernetes operators. Spark Operator
is open-source and available on GitHub. It is also available as a Helm chart, which makes it easy to deploy on
Kubernetes. In this tutorial, we will use the Helm chart to deploy Spark Operator on a Kubernetes cluster.

Spark Operator offers various features to simplify the management of Spark applications in Kubernetes
environments. These include declarative application specification and management using custom resources,
automatic submission of eligible SparkApplications, native cron support for scheduled applications, and
customization of Spark pods beyond native capabilities through the mutating admission webhook.

Additionally, the tool supports automatic re-submission and restart of updated SparkAppliations, as well as
retries of failed submissions with linear back-off. It also provides functionality to mount local Hadoop
configuration as a Kubernetes ConfigMap and automatically stage local application dependencies to MinIO via
sparkctl. Finally, the tool supports the collection and export of application-level metrics and driver/executor
metrics to Prometheus.

Prerequisites

To follow this tutorial, you will need:

1. A Kubernetes cluster. You can use Minikube to set up a local Kubernetes cluster on your machine.

2. Helm, the package manager for Kubernetes. You can follow this guide to install Helm on your
machine.
3. A MinIO server running on bare metal or Kubernetes. You can follow this guide to install MinIO on bare
metal or this guide to install MinIO on Kubernetes or you can use the MinIO Play server for testing
purposes.

A Collection of Posts on MinIO and Kubernetes | 91


4. A MinIO client (mc) to access the MinIO server. You can follow this guide to install mc on your
machine.

Install Spark Operator

To install Spark Operator, you need to add the Helm repository for Spark Operator to your local Helm client.
You can do this by running the following command:

Unset
helm repo add spark-operator
https://googlecloudplatform.github.io/spark-on-k8s-operator

Once the repository is added, you can install Spark Operator using the following command (you may have to
wait a minute while it is installed):

Unset
helm install my-release spark-operator/spark-operator \

--namespace spark-operator \

--set webhook.enable=true \

--set image.repository=openlake/spark-operator \

--set image.tag=3.3.1 \

--create-namespace

You will see the following output:

Unset
LAST DEPLOYED: Mon Feb 27 19:48:33 2023

NAMESPACE: spark-operator

STATUS: deployed

REVISION: 1

TEST SUITE: None

A Collection of Posts on MinIO and Kubernetes | 92


This command installs Spark Operator in the spark-operator namespace, and enables the mutating
admission webhook. The webhook is required to enable the mounting of local Hadoop configuration as a
Kubernetes ConfigMap and to configure env variables that driver and executor can use. The image repository
and tag are set to the image that contains the latest version of Spark Operator. You can also use the default
image repository and tag by omitting the --set image.repository and --set image.tag flags, at the
time of this writing the latest Spark Operator release used 3.1.1 version of Spark whereas
openlake/spark-operator used the latest 3.3.1 release of Spark. You can skip the --create-namespace
flag if you already have a namespace named spark-operator. This will also monitor all the Spark
applications in all the namespaces.

A detailed list of configuration options can be found here.

Verify Spark Operator Installation

To verify that Spark Operator is installed successfully, you can run the following command:

Unset
kubectl get pods -n spark-operator

You will see a result similar to the following output:

Unset
NAME READY STATUS RESTARTS AGE

my-release-spark-operator-f56c4d8c4-pr857 1/1 Running 0 14m

Now that we have the Spark operator installed, we can deploy a Spark application or Scheduled Spark
application on Kubernetes.

Deploy a Spark Application

Let's try deploying one of the example simple Spark applications that comes with the Spark operator. You can
find the list of example applications here, and we’re interested in calculating Pi, so we will modify the spark Pi
application to use Spark 3.3.1 and run it on Kubernetes.

Unset
apiVersion: "sparkoperator.k8s.io/v1beta2"

kind: SparkApplication

A Collection of Posts on MinIO and Kubernetes | 93


metadata:

name: pyspark-pi

namespace: spark-operator

spec:

type: Python

pythonVersion: "3"

mode: cluster

image: "openlake/spark-py:3.3.1"

imagePullPolicy: Always

mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py

sparkVersion: "3.3.1"

restartPolicy:

type: OnFailure

onFailureRetries: 3

onFailureRetryInterval: 10

onSubmissionFailureRetries: 5

onSubmissionFailureRetryInterval: 20

driver:

cores: 1

coreLimit: "1200m"

memory: "512m"

labels:

version: 3.1.1

serviceAccount: my-release-spark

executor:

cores: 1

A Collection of Posts on MinIO and Kubernetes | 94


instances: 1

memory: "512m"

labels:

version: 3.3.1

The above application will calculate the value of Pi using Spark on Kubernetes. You can save the above
application as spark-pi.yaml and deploy it using the following command:

Unset
kubectl apply -f spark-pi.yaml

To verify that the job is running, you can run the following:

Unset
kubectl -n spark-operator get pods

And you should see something like this:

Unset
NAME READY STATUS RESTARTS
AGE

my-release-spark-operator-59bccf4d94-fdrc9 1/1 Running 0


24d

my-release-spark-operator-webhook-init-jspnn 0/1 Completed 0


68d

pyspark-pi-driver 1/1 Running 0


23s

pythonpi-b6a3e48693762e5d-exec-1 1/1 Running 0


7s

You can check the status of the application using the following command:

A Collection of Posts on MinIO and Kubernetes | 95


Unset
kubectl get sparkapplications -n spark-operator

You will see the following output:

Unset
NAME STATUS ATTEMPTS START FINISH
AGE

pyspark-pi COMPLETED 1 2023-02-27T15:20:29Z 2023-02-27T15:20:59Z


10m

You can also check the logs of the application using the following command:

Unset
kubectl logs pyspark-pi-driver -n spark-operator

You will see the following output:

Unset
23/02/27 15:20:55 INFO DAGScheduler: Job 0 finished: reduce at
/opt/spark/examples/src/main/python/pi.py:42, took 2.597098 s

Pi is roughly 3.137960

23/02/27 15:20:55 INFO SparkUI: Stopped Spark web UI at


http://pyspark-pi-d73653869375fa87-driver-svc.spark-operator.svc:4040

23/02/27 15:20:55 INFO KubernetesClusterSchedulerBackend: Shutting down all


executors

Now that we have the simple Spark application working as expected we can try to read and write data from
MinIO using Spark.

Read and Write Data from MinIO using Spark

Reading and writing data from and to MinIO using Spark is very simple once we have the right dependencies
and configurations in place. In this post we will not be discussing the dependencies, to keep things simple we
use the openlake/spark-py:3.3.1 image that contains all the dependencies required to read and write data
from MinIO using Spark.

A Collection of Posts on MinIO and Kubernetes | 96


Getting Demo Data into MinIO

We will be using the NYC Taxi dataset that is available on MinIO. You can download the dataset from here
which has ~112M rows and is ~10GB in size. For this exercise any existing or new MinIO deployment with
enough free space. You can use any other dataset of your choice and upload it to MinIO using the following
commands, first we’ll create the buckets that will be referenced by our applications:

Unset
mc mb <Your-MinIO-Endpoint>/openlake

mc mb <Your-MinIO-Endpoint>/openlake/spark

mc mb <Your-MinIO-Endpoint>/openlake/spark/sample-data

mc cp nyc-taxi-data.csv
<Your-MinIO-Endpoint>/openlake/spark/sample-data/nyc-taxi-data.csv

Sample Python Application

Let's now read and write data from MinIO using Spark. We will use the following sample python application to
do that.

Unset
import logging

import os

from pyspark import SparkContext

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, LongType, DoubleType,


StringType

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s -


%(levelname)s - %(message)s")

logger = logging.getLogger("MinioSparkJob")

spark = SparkSession.builder.getOrCreate()

def load_config(spark_context: SparkContext):

A Collection of Posts on MinIO and Kubernetes | 97


spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key",
os.getenv("AWS_ACCESS_KEY_ID", "<Your-MinIO-AccessKey>"))

spark_context._jsc.hadoopConfiguration().set("fs.s3a.secret.key",

os.getenv("AWS_SECRET_ACCESS_KEY", "<Your-MinIO-SecretKey>"))

spark_context._jsc.hadoopConfiguration().set("fs.s3a.endpoint",
os.getenv("ENDPOINT", "<Your-MinIO-Endpoint>"))

spark_context._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled",
"true")

spark_context._jsc.hadoopConfiguration().set("fs.s3a.path.style.access",
"true")

spark_context._jsc.hadoopConfiguration().set("fs.s3a.attempts.maximum", "1")

spark_context._jsc.hadoopConfiguration().set("fs.s3a.connection.establish.timeo
ut", "5000")

spark_context._jsc.hadoopConfiguration().set("fs.s3a.connection.timeout",
"10000")

load_config(spark.sparkContext)

# Define schema for NYC Taxi Data

schema = StructType([

StructField('VendorID', LongType(), True),

StructField('tpep_pickup_datetime', StringType(), True),

StructField('tpep_dropoff_datetime', StringType(), True),

StructField('passenger_count', DoubleType(), True),

StructField('trip_distance', DoubleType(), True),

StructField('RatecodeID', DoubleType(), True),

StructField('store_and_fwd_flag', StringType(), True),

StructField('PULocationID', LongType(), True),

StructField('DOLocationID', LongType(), True),

StructField('payment_type', LongType(), True),

A Collection of Posts on MinIO and Kubernetes | 98


StructField('fare_amount', DoubleType(), True),

StructField('extra', DoubleType(), True),

StructField('mta_tax', DoubleType(), True),

StructField('tip_amount', DoubleType(), True),

StructField('tolls_amount', DoubleType(), True),

StructField('improvement_surcharge', DoubleType(), True),

StructField('total_amount', DoubleType(), True)])

# Read CSV file from MinIO

df = spark.read.option("header", "true").schema(schema).csv(

os.getenv("INPUT_PATH", "s3a://openlake/spark/sample-data/taxi-data.csv"))

# Filter dataframe based on passenger_count greater than 6

large_passengers_df = df.filter(df.passenger_count > 6)

total_rows_count = df.count()

filtered_rows_count = large_passengers_df.count()

# File Output Committer is used to write the output to the destination (Not
recommended for Production)

large_passengers_df.write.format("csv").option("header", "true").save(

os.getenv("OUTPUT_PATH", "s3a://openlake-tmp/spark/nyc/taxis_small"))

logger.info(f"Total Rows for NYC Taxi Data: {total_rows_count}")

logger.info(f"Total Rows for Passenger Count > 6: {filtered_rows_count}")

The above application reads the NYC Taxi dataset from MinIO and filters the rows where the passenger count
is greater than 6. The filtered data is then written to MinIO. You can save the above code as main.py.

Building the Docker Image

We will now build the docker image that contains the above python application. You can create a Dockerfile
with the following contents to build the image:

A Collection of Posts on MinIO and Kubernetes | 99


Unset
FROM openlake/spark-py:3.3.1

USER root

WORKDIR /app

RUN pip3 install pyspark==3.3.1

COPY src/*.py .

You can build your own docker image or use the pre-built image openlake/sparkjob-demo:3.3.1 that is
available on Docker Hub. If you need a refresher on building docker images, please see docker build.

Deploying the MinIO Spark Application

To read and write data from MinIO using Spark, you need to create a secret that contains the MinIO access
key and secret key. You can create the secret using the following command:

Unset
kubectl create secret generic minio-secret \

--from-literal=AWS_ACCESS_KEY_ID=<Your-MinIO-AccessKey> \

--from-literal=AWS_SECRET_ACCESS_KEY=<Your-MinIO-SecretKey> \

--from-literal=ENDPOINT=<Your-MinIO-Endpoint> \

--from-literal=AWS_REGION=us-east-1 \

--namespace spark-operator

You will see the following output:

Unset
secret/minio-secret created

Now that we have the secret created, we can deploy the Spark application that reads and writes data from
MinIO. You can save the following application as sparkjob-minio.yaml:

A Collection of Posts on MinIO and Kubernetes | 100


Unset
apiVersion: "sparkoperator.k8s.io/v1beta2"

kind: SparkApplication

metadata:

name: spark-minio

namespace: spark-operator

spec:

type: Python

pythonVersion: "3"

mode: cluster

image: "openlake/sparkjob-demo:3.3.1"

imagePullPolicy: Always

mainApplicationFile: local:///app/main.py

sparkVersion: "3.3.1"

restartPolicy:

type: OnFailure

onFailureRetries: 3

onFailureRetryInterval: 10

onSubmissionFailureRetries: 5

onSubmissionFailureRetryInterval: 20

driver:

cores: 1

memory: "1024m"

labels:

version: 3.3.1

serviceAccount: my-release-spark

env:

A Collection of Posts on MinIO and Kubernetes | 101


- name: AWS_REGION

value: us-east-1

- name: AWS_ACCESS_KEY_ID

value: <Your-MinIO-AccessKey>

- name: AWS_SECRET_ACCESS_KEY

value: <Your-MinIO-SecretKey>

executor:

cores: 1

instances: 3

memory: "1024m"

labels:

version: 3.3.1

env:

- name: INPUT_PATH

value: "s3a://openlake/spark/sample-data/taxi-data.csv"

- name: OUTPUT_PATH

value: "s3a://openlake/spark/output/taxi-data-output"

- name: AWS_REGION

valueFrom:

secretKeyRef:

name: minio-secret

key: AWS_REGION

- name: AWS_ACCESS_KEY_ID

valueFrom:

secretKeyRef:

name: minio-secret

A Collection of Posts on MinIO and Kubernetes | 102


key: AWS_ACCESS_KEY_ID

- name: AWS_SECRET_ACCESS_KEY

valueFrom:

secretKeyRef:

name: minio-secret

key: AWS_SECRET_ACCESS_KEY

- name: ENDPOINT

valueFrom:

secretKeyRef:

name: minio-secret

key: ENDPOINT

The above Python Spark Application YAML file contains the following configurations:

■ spec.type: The type of the application. In this case, it is a Python application.

■ spec.pythonVersion: The version of Python used in the application.


■ spec.mode: The mode of the application. In this case, it is a cluster mode application.
■ spec.image: The docker image that contains the application.
■ spec.imagePullPolicy: The image pull policy for the docker image.
■ spec.mainApplicationFile: The path to the main application file.
■ spec.sparkVersion: The version of Spark used in the application.
■ spec.restartPolicy: The restart policy for the application. In this case, the application will be
restarted if it fails. The application will be restarted 3 times with a 10 seconds interval between each
restart. If the application fails to submit, it will be restarted 5 times with a 20 seconds interval
between each restart.
■ spec.driver: The driver configuration for the application. In this case, we are using the
my-release-spark service account. The driver environment variables are set to read and write data
from MinIO.
■ spec.executor: The executor configuration for the application. In this case, we are using 3
executors with 1 core and 1GB of memory each. The executor environment variables are set to read
and write data from MinIO.

You can deploy the application using the following command:

A Collection of Posts on MinIO and Kubernetes | 103


Unset
kubectl apply -f sparkjob-minio.yaml

After the application is deployed, you can check the status of the application using the following command:

Unset
kubectl get sparkapplications -n spark-operator

You will see the following output:

Unset
NAME STATUS ATTEMPTS START FINISH AGE

spark-minio RUNNING 1 2023-02-27T18:47:33Z <no value> 4m4s

Once the application is completed, you can check the output data in MinIO. You can use the following
command to list the files in the output directory:

Unset
mc ls minio/openlake/spark/output/taxi-data-output

You can also check the logs of the application using the following command:

Unset
kubectl logs -f spark-minio-driver -n spark-operator

You will see the following output:

Unset
23/02/27 19:06:11 INFO FileFormatWriter: Finished processing stats for write
job 91dee4ed-3f0f-4b5c-8260-bf99c0b662ba.

A Collection of Posts on MinIO and Kubernetes | 104


2023-02-27 19:06:11,578 - MinioSparkJob - INFO - Total Rows for NYC Taxi Data:
112234626

2023-02-27 19:06:11,578 - MinioSparkJob - INFO - Total Rows for Passenger Count


> 6: 1066

2023-02-27 19:06:11,578 - py4j.clientserver - INFO - Closing down clientserver


connection

23/02/27 19:06:11 INFO SparkUI: Stopped Spark web UI at


http://spark-minio-b8d5c4869440db05-driver-svc.spark-operator.svc:4040

23/02/27 19:06:11 INFO KubernetesClusterSchedulerBackend: Shutting down all


executors

There is also an option for you to use the Spark UI to monitor the application while it runs. You can use the
following command to port forward the Spark UI for external access:

Unset
kubectl port-forward svc/spark-minio-ui-svc 4040:4040 -n spark-operator

In your browser, you can access the Spark UI using the following URL:

Unset
http://localhost:4040

You will see the following Spark UI:

A Collection of Posts on MinIO and Kubernetes | 105


Once the application is completed, you can delete the application using the following command:

Unset
kubectl delete sparkapplications spark-minio -n spark-operator

Deploying a Scheduled Spark Application is almost the same as deploying a normal Spark Application. The
only difference is that you need to add the spec.schedule field to the Spark Application YAML file and the
kind is ScheduledSparkApplication. You can save the following application as
sparkjob-minio-scheduled.yaml:

Unset
apiVersion: "sparkoperator.k8s.io/v1beta2"

kind: ScheduledSparkApplication

metadata:

name: spark-scheduled-minio

namespace: spark-operator

spec:

schedule: "@every 1h" # Run the application every hour

concurrencyPolicy: Allow

template:

type: Python

pythonVersion: "3"

mode: cluster

image: "openlake/sparkjob-demo:3.3.1"

imagePullPolicy: Always

mainApplicationFile: local:///app/main.py

sparkVersion: "3.3.1"

restartPolicy:

type: OnFailure

A Collection of Posts on MinIO and Kubernetes | 106


onFailureRetries: 3

onFailureRetryInterval: 10

onSubmissionFailureRetries: 5

onSubmissionFailureRetryInterval: 20

driver:

cores: 1

memory: "1024m"

labels:

version: 3.3.1

serviceAccount: my-release-spark

env:

- name: AWS_REGION

value: us-east-1

- name: AWS_ACCESS_KEY_ID

value: <Your-MinIO-AccessKey>

- name: AWS_SECRET_ACCESS_KEY

value: <Your-MinIO-SecretKey>

executor:

cores: 1

instances: 3

memory: "1024m"

labels:

version: 3.3.1

env:

- name: INPUT_PATH

value: "s3a://openlake/spark/sample-data/taxi-data.csv"

A Collection of Posts on MinIO and Kubernetes | 107


- name: OUTPUT_PATH

value: "s3a://openlake/spark/output/taxi-data-output"

- name: AWS_REGION

valueFrom:

secretKeyRef:

name: minio-secret

key: AWS_REGION

- name: AWS_ACCESS_KEY_ID

valueFrom:

secretKeyRef:

name: minio-secret

key: AWS_ACCESS_KEY_ID

- name: AWS_SECRET_ACCESS_KEY

valueFrom:

secretKeyRef:

name: minio-secret

key: AWS_SECRET_ACCESS_KEY

- name: ENDPOINT

valueFrom:

secretKeyRef:

name: minio-secret

key: ENDPOINT

You can deploy and see the results of the application in the same way as the normal Spark Application. The
above Spark Application will run every hour and will write the output to the same bucket.

All the source code for this tutorial is available in the following GitHub repository: openlake/spark

A Collection of Posts on MinIO and Kubernetes | 108


Spark-itect for the Future

Apache Spark and MinIO are powerful tools for data lakes and analytics. Running Spark on Kubernetes gives
you the benefits of better resource management, fault tolerance and scalability for Spark jobs. Add high
performance and highly scalable MinIO and you have a combination that supports all your Spark workloads
wherever you need to run them – public/private cloud, data center, edge – on the Kubernetes platform of your
choice.

Download MinIO and give the Spark Operator a test drive. If you’ve got questions, please ask us on our Slack
channel.

About MinIO
MinIO is pioneering high performance, Kubernetes-native object storage for the multi-cloud. The
software-defined, Amazon S3-compatible object storage system is used by more than half of the Fortune
500. With 1.18B+ Docker pulls, MinIO is the fastest-growing cloud object storage company and is consistently
ranked by industry analysts as a leader in object storage. Founded in 2014, the company is backed by Intel
Capital, Softbank Vision Fund 2, Dell Technologies Capital, Nexus Venture Partners, General Catalyst and key
angel investors.

Additional Information:

MinIO Inc. Resources:


email: hello@min.io https://min.io
275 Shoreline Dr, Ste 100, https://docs.min.io/
Redwood City, CA 94065, https://blog.min.io/
United States

A Collection of Posts on MinIO and Kubernetes | 109

You might also like