Professional Documents
Culture Documents
Large Scale Testing With MAAS WP 05.08.19
Large Scale Testing With MAAS WP 05.08.19
Introduction
Designed with building agile data centres in mind, MAAS (Metal-as-a-service)
turns the management of bare-metal server resources into a cloud like
provisioning experience. This maximises the use of hardware, reduces the
cost of operations and increases developer productivity.
Reducing the cost of operations, however, is not only measured by how quickly
workloads can be deployed. Operational efficiency is also measured by how
quickly and reliably the hardware can be on-boarded, made available, and verified
throughout its life cycle.
Based on that premise, MAAS was also designed to reduce the friction of the
hardware validation process. It provides a testing framework that allows large-
scale, automated testing of hardware with minimum effort to ensure that the
physical infrastructure is ready to serve production workloads.
An overview of MAAS hardware testing
MAAS provides a flexible and straightforward framework for administrators
to test their hardware. It also provides component level feedback that enables
administrators to quickly identify issues and take corrective actions before
putting hardware into production.
2
How does it work?
The ephemeral Ubuntu environment used for hardware testing is delivered
via PXE boot of the machine. It is the same environment used for tasks such as
commissioning, deployment or rescue mode. Once booted the machine is
instructed to run each selected test via cloud-init and the MAAS script runner.
Installing dependencies
In order to run tests, each script may require software dependencies that are
not available in the ephemeral environment. For this reason, each test can declare
whether it requires software to be installed beforehand, and supports the
different sources:
Tracking progress
To track progress, every two minutes the script runner inside the ephemeral
environment signals the status of each script that is currently running to MAAS.
The ephemeral environment periodically pings MAAS to ensure machines don’t
run indefinitely. If MAAS does not receive a ping after 10 minutes, MAAS assumes
hardware testing has caused the system to lock up. Consequently, MAAS marks
this test as failed and updates the status of the machine.
Tracking results
After running the tests, MAAS can collect metrics, logs and the results
of each test. The results will ultimately determine the status of the machine:
• Pending – The test is slated to run but has not begun yet
• Installing – Dependencies are being installed for the test before it begins running
• Running – The test is currently running
• Passed – The test has completed successfully
• Failed – The test failed
• Failed installing – Dependencies required for the test to run were unable
to be installed
• Timed out – The test ran past its specified run time or MAAS did not receive
a heartbeat from the machine being tested for over 10 minutes
• Aborted – The test was aborted before completing
• Degraded – The test has completed and the machine is usable but the test
detected degraded performance
• Skipped – The test is unable to run on current hardware
Additionally, MAAS captures and logs all script output, which is later sent to
MAAS for storage. A test can also report metrics on completion, which MAAS
can display over the UI and API.
Storage
The storage tests are designed to verify that the underlying storage has no
errors. Storage tests can be also be divided into two main categories –
destructive and non-destructive. Destructive tests will overwrite the contents
of a storage device. Destructive tests may not be run on an already-deployed
machine. Storage tests are described in the following table:
Memory
The memory tests are targeted to find issues with RAM. MAAS provides the
following tests:
4
CPU
The CPU tests are focused on testing the reliability of the CPU, but also provide
the ability to run CPU benchmarking:
Connectivity
The following tests are designed to validate proper network configuration:
ntp network, ntp, Run ntp clock set to verify NTP connectivity.
node
5
Testing at scale
With a fundamental understanding of how hardware testing works in MAAS,
the next step is to address testing servers at scale.
To start, all 14 machines are selected and set on the ‘Commissioning’ action. This
action gives the ability to select both commissioning options and select hardware
tests. To extend the default test cycle with smartctl-validate, select additional
tests such as smartctl-short for storage, 7z and stress-ng-cpu-short for CPU, and
stress-ng-memory-short for memory.
6
At this point, MAAS will start the commissioning process, and as soon as it
completes, it will automatically start the testing the machines. MAAS will update
the UI with feedback about the status of the machines and whether tests are
running or pending for each of the major components, as per the image below:
MAAS will also provide feedback in the machine details page for each of the
major components (CPU, memory and storage) in the ‘Machine summary’ tab, but
also on the ‘Hardware tests’ (which are explored in a later section).
Upon successful completion machines will become usable again (whether that
means in ‘Ready’ or back to ‘Deployed’ state depends on the original state of
the machine).
Please also note that machines in the ‘Ready’ state can also be tested with
the ‘Test hardware’ action.
7
Testing deployed machines
Machines in the ‘Deployed’ state can also be tested by executing the ‘Test
hardware’ action. The main difference in testing deployed machines is that MAAS
will prevent the execution of destructive tests.
Testing failures
When a machine fails testing, it is transitioned to a ‘Failed testing’ state and
MAAS will highlight which component has failed (CPU, RAM or storage), as per
the figure below:
NOTE: For the purpose of this example, a custom test (stress-ng-cpu-custom) was created to simulate
a test failure.
8
To investigate the test failure, the ‘Machine details’ under the ‘Hardware tests’
tab will show which tests have failed (and which succeeded). In this case, it is the
stress-ng-cpu-custom, detailed in the figure below:
To further investigate the failure, the ‘View log’ link provides all the information
collected about the test which can help the administrator determine the reason
of the failure.
NOTE: If test failures are false-positives, administrators can re-execute the process. If the test
succeeds, the hardware will be set to ‘Ready,’ show no errors, and it will keep track of all previous
tests that were run, whether successful or not.
9
Overriding failed tests
Even if a machine has ‘Failed testing,’ administrators may want to continue to use
machines because the test result does not affect their use-case, or the failure
was a false positive. However, in order to make this machine usable again and get
it back to its previous state, the administrator will have to manually confirm that
they want to do so. This is done with the action ‘Override failed testing.’
Once the machine has been overridden, MAAS will show a warning and continue
to display the component that failed to warn the user that this machine failed
testing. The main difference is that on a deployed system, the machine will be
transitioned back to ‘Deployed,’ whereas in a system that was in ‘New’ or ‘Ready’
state (or any other), the machine will be transitioned to ‘Ready’ state, as per the
figure below.
10
Conclusion
Hardware testing is a critical step in the infrastructure’s hardware life cycle, and
organizations have to ensure this is a priority for the continued success of their
operations. By doing so early, organizations will ensure their infrastructure is
top-notch to answer their application demands, reducing the risk of failures on
production and peak times that could put at risk their whole infrastructure.
https://docs.maas.io/2.5/en/nodes-scripts
https://maas.io
© Canonical Limited 2019. Ubuntu, Kubuntu, Canonical and their associated logos are the registered trademarks
of Canonical Ltd. All other trademarks are the properties of their respective owners. Any information referred
to in this document may change without notice and Canonical will not be held responsible for any such changes.
Canonical Limited, Registered in England and Wales, Company number 110334C Registered Office:
12-14 Finch Road, Douglas, Isle of Man, IM99 1TT VAT Registration: GB 003 2322 47