You are on page 1of 9

Incumbent Overview

Background
Incumbent is the Ansible project used by the AVS Platform team to perform automated and repeatable
tasks on the Azure VMware Solution (AVS) internal platform. Incumbent consists of Ansible roles,
custom Ansible modules and playbooks that are distributed as single Debian package.

There are three targeted uses for Incumbent that consist of:
1. Command line (from CLU) for initial standup of the Pod Under Cloud (PUC)
2. VIA AWX RESTful API in the PUC space
3. VIA AWX RESTful API in the Tenant Under Cloud (TUC) space

Incumbent via AWX is designed to be a task runner. There is no state stored in Incumbent itself. All
state comes directly from Conveyor (see External Dependencies below for more info on Conveyor) and
relevant state data is passed in via the API call that Conveyor makes.

This document will cover the three use cases above, along with identifying external dependencies and
how they interact. This document will not be a deep dive on the Incumbent code itself, but rather an
overview of how it operates and where it fits in the overall AVS pipeline.

External dependencies
Incumbent relies on several external dependencies in order to perform. Below is a list of these
dependencies and how they impact Incumbent.

 Conveyor
o Control plane execution/workflow mechanism
o Performs calls into Incumbent/AWX to stand up, manage and tear down SDDC
components.
o State data is stored in Conveyor, Incumbent is designed as the task runner.
o Per tenant Azure Keyvaults are created by Conveyor and passed into Incumbent
workflows in order to keep tenant secrets unique and not shared.
o Monitoring and Alerting of Incumbent job failures is driven through Conveyor today.
 Woke
o Stands up the PUC related components, such as Azure DNS, Azure Keyvault and many
other resources that Incumbent relies on to function properly.
 Azure Keyvault
o All passwords that Incumbent relies on are stored in Azure Keyvaults. Keyvaults are
generally split up based on environment (prod/staging/testing/lab), region
(eastus/westeurope/etc.), site (rtsn01/eastus/etc.) and pod (p01/p02/etc.).
o Per tenant Keyvaults are leveraged in TUC space for security purposes.
 Azure Storage Blob
o ISO images are stored in Azure Storage Blobs and can be downloaded when needed via
Incumbent.
o All the vmcp-releases artifacts, notably the dump.json for a site, are stored in Azure
Storage Blob and incumbent accesses this data via web URL.
 CLU
o For COLO sites, CLU is the first server in Rack 01.
o For DC sites, CLU is an Azure Virtual Machine stood up by Woke.
o Used in PUC standup. More details in the next section.
 VMCP Releases
o Repo that provides the version manifests via Azure Storage Blob. The dump.json file is
specific to a pod and defines all the software versions for that site along with many
other configuration parameters.
 Stackstorm
o Monitors VMCP Releases dump.json in order to perform automated updates of many of
the software components in a POD.

Use Cases
Command line (from CLU) for initial standup of the Pod Under Cloud (PUC)

The first Incumbent use case is when a pod is being stood up. From CLU, incumbent will be kicked off by
hand using ansible-playbook to perform all the steps for standing up the PUC to include:

 Deploy and configure PUC VMware cluster


o Create an initial VSAN datastore and deploy VCSA
o Join the designated VMware ESXi hosts to the PUC cluster
o Assign all disks and expand and properly configure the full VSAN datastore
 Deploy and configure NSX-V Software Define Networking solution
o Deploy NSX-V Manager
o Deploy NSX-V Controllers
o Configure NSX-V VXLAN
 Deploy and configure the virtual routers so that the PUC management components can talk to
the internet.
 Deploy the PUC management virtual machines
o Deploy and configure the PUC Linux Jumpbox
o Deploy and configure the docker on all other virtual machines
o Deploy and configure the Portainer manager to manage the docker environment
o Deploy the Portainer stacks (docker compose stacks) for the PUC services
o Deploy and configure other miscellaneous software components (AzSecPack,
Wavefront, OMS Agents, Vulnerability Scanner, etc)
 Configure the PUC AWX instance (deployed in the previous step)
 Deploy and configure Stackstorm/st2-vmcp-releases
VIA AWX RESTful API in the PUC space
The second use case for Incumbent comes into use once the PUC components have been stood up and
the ESXi hosts for the POD have been identified, validated and provisioned. In this use case, Incumbent
is combined with AWX (open sourced version of Ansible Tower) in order to provide a RESTful API
endpoint to allow Conveyor to make calls into AWX to run Incumbent job templates and workflow job
templates.

The calls from Conveyor to the PUC AWX are targeted to a specific tenant via the use of AWX
inventories. The workflows will be executed in the same order, but with specific data for the target
tenant, to include items such as tenant id, customer provided network address space used in the
standup of the customer SDDC, tenant specific Keyvaults, etc.

A standard workflow for standing up a new tenant will start with the following actions against the PUC
AWX instance from Conveyor (blue denotes new as of sark 2.0):

1. sark-express-route – AWX job template that completes the POD side of the express route stand
up for the tenant management and tenant customer express routes.
2. tuc-bootstrap – AWX job template that deploys the TUC virtual machine components (jump box,
services host, nsx-v edge devices, virtual routers for ipsec tunnels)
3. network-node-move – AWX job template that moves an ESXi host from PUC management space
into tenant networking space. Performed for each host Conveyor has assigned to the new
tenant.
4. ipsec-internet – AWX job template that configures the virtual routes (vrtr’s deployed in the TUC
space) and the internet gateways (igw’s deployed in Azure) to stand up the ipsec tunnels for the
TUC space to have internet access.
5. bootstrap – AWX job template that configures the Linux Jump Box (ljb), Installs docker on the
services host (svcs), deploys AWX, NTP, DNS, and other management docker containers onto the
services host.
6. update-source – AWX job template that installs the proper version of Incumbent onto the
services host in order to make it available via the TUC AWX instance.
7. awx-config – AWX job template that configures the TUC AWX instance (job templates, workflow
job templates, inventories, secrets, projects, etc.)

From this point, Conveyor would transition to calling directly to the TUC AWX instance, which is
described in the third use case section below.

Other operations that Conveyor will call into the PUC AWX are listed below.

SDDC Scaling (adding/removing hosts):

 network-node-move - Move host from PUC management space to TUC network space (add
host) or from TUC network space to PUC management space (remove host)

SDDC Delete (tearing down an entire customer SDDC):

 network-node-move - Move host from TUC network space to PUC management space
 sark-express-route - Tear down tenant management and tenant customer express route
configuration.
 tuc-teardown – Remove all of the TUC VMs and associated components.

VIA AWX RESTful API in the Tenant Under Cloud (TUC) space

The third and final use case is Conveyor calling directly into the AWX RESTful API in the Tenant Under
Cloud (TUC) space. The TUC AWX instance has a specific set of workflow job templates that Conveyor
will call directly. These workflows are:

 awx_update_workflow – AWX workflow job template that runs the steps to update incumbent
and run awx-config. This is currently not called yet.
 sddc_external_auth_config_workflow – Called when the customer sets up an external
authentication source in the portal. Conveyor passes the information into this workflow in
order to configure the VCSA to use the new authentication information.
 sddc_host_add_workflow – Called when a customer scales their SDDC up, either adding hosts or
adding a cluster.
 sddc_host_remove_workflow – Called when a customer scales their SDDC down, either
removing hosts or removing a cluster
 vmcp_new_tenant_workflow – Called when a customer creates a new SDDC. This workflow
performs all the steps needed to stand up a new VMware cluster, install NSX-T, HCX and other
associated software.
 vmcp_fetch_tenant_certs_workflow – Called to retrieve the self-signed certificate data so that
customers can add it into their internal key stores if needed.
 vmcp_tenant_restore_nsx_workflow – Called after the NSX-T manager has been replaced in a
disaster recovery scenario. This will restore the latest NSX-T backup data.
 vmcp_tenant_restore_vcsa_workflow – Called after the VCSA has been replaced in a disaster
recovery scenario. This will restore the latest VCSA backup data.

The type of operation will dictate which of the above workflow might be called. For a clear illustration
of the process, please refer to the next section on process flows.

Process flows
Below are the four most prominent process flows in the AVS SDDC space. These should give a good
understand between the role of the PUC AWX and TUC AWX and help visualize the entire process for
discussions later in this document.

New Customer SDDC


Scale Up Customer SDDC
Scale Down Customer SDDC
Delete Customer SDDC

Monitoring/Alerting
The main monitoring and alerting path for failures in Incumbent workflows is Conveyor. Conveyor is the
workflow engine for the customer SDDCs monitors all steps for success or failure and reports the failures
back through AppInsights and/or Geneva (this work is in transition between the two). Each of the calls
depicted above from Conveyor to either the PUC AWX or TUC AWX is a set of calls that look like the
following:

1. Submit the job to AWX passing in all the required data.


2. Start a poll loop, checking for status of the previously submitted job.
3. Once the job is finished, if the job succeeds then move on to the next step. If the job fails, send
logs and alerts via the appropriate channels and restart the workflow at the failed step.

In addition to this main path, Incumbent also sends data into Geneva as well as posts alerts into Teams
channels along with links to Azure Storage Blobs that hold uploaded logs directly from Incumbent to
help aid in troubleshooting.

In addition, the PUC AWX service is monitored via Wavefront to ensure it is up and healthy, and will
otherwise send an alert to the SRE team.

Performance Considerations
When discussing performance around Incumbent/AWX, we must first understand the profile of the work
and limitations around how much will be handled at any given time.

For the PUC AWX, the biggest limiting factor on how much work can be submitted is the maximum
number of tenants that any POD will support. In the case of our current colocation sites, the maximum
is 64, and in the new dc style sites that maximum will be 84. That translates into a maximum of 84
concurrent operations ever for these sites in the current configuration.

The PUC AWX is a dedicated virtual machine with 4 vCPU and 16GB of RAM. This configuration still has
plenty of headroom and, if needed, can be expanded in the future.

For the TUC AWX, the limiting factor is that the portal code will not allow more than one operation to
occur against a tenant at a time, meaning we will only have 1 TUC AWX workflow running at a time,
ever.

Overall, this is a low volume pipeline built specifically for performing SDDC operations.

Testing Considerations
The current testing strategy of Incumbent relies on functional tests ran periodically via Jenkins. The
tests can also be kicked off manually at any time if there are resources available. These functional tests
take a long time to run (on the order of a few hours) and can add a lot of time into an iterative
development cycle. There are some strategies that can been leveraged, with deploying an SDDC in the
lab and performing manual testing of components, however this can lead us into resource starvation for
the automated functional tests if not used sparingly.

Also, due to the nature of what Incumbent does, it has been difficult to mockup all the interfaces that
Incumbent needs in order to not need lot of physical assets for testing. This is still very much under
investigation, but at this time our testing efforts are limited by how many resources are available in our
lab environment.

You might also like