You are on page 1of 11

Large-scale automated hardware

testing with MAAS


August 2019

Introduction
Designed with building agile data centres in mind, MAAS (Metal-as-a-service)
turns the management of bare-metal server resources into a cloud like
provisioning experience. This maximises the use of hardware, reduces the
cost of operations and increases developer productivity.

Reducing the cost of operations, however, is not only measured by how quickly
workloads can be deployed. Operational efficiency is also measured by how
quickly and reliably the hardware can be on-boarded, made available, and verified
throughout its life cycle.

A critical step in the hardware on-boarding process is to provide confidence that


it is reliable enough to satisfy the workload requirements. As such, organisations
put a lot of effort into testing their hardware to ensure minimum requirements
are met before making it available to their workloads. This is a critical, yet time
consuming process.

Based on that premise, MAAS was also designed to reduce the friction of the
hardware validation process. It provides a testing framework that allows large-
scale, automated testing of hardware with minimum effort to ensure that the
physical infrastructure is ready to serve production workloads.
An overview of MAAS hardware testing
MAAS provides a flexible and straightforward framework for administrators
to test their hardware. It also provides component level feedback that enables
administrators to quickly identify issues and take corrective actions before
putting hardware into production.

By using an internal workflow mechanism, MAAS makes running complex


hardware tests at scale as easy as deploying a single machine.

How MAAS works

What can MAAS do?


Before showing how hardware testing works in MAAS, it is important to
understand what the framework allows users to do:

• Administrators can test the various components of a machine. MAAS provides


a list of predefined tests mainly targeted at CPU, memory and storage.
• Administrators can also create and run their own tests, gather results and
collect metrics.
• MAAS enables differentiation between destructive and non-destructive tests,
which lets MAAS prevent data loss while testing on deployed machines.
• It also gives the ability to decide which tests to run manually every time,
or which to run automatically.
• MAAS tracks test progress and reports issues in case of failure.
• MAAS provides excellent feedback in its UI about the status of each component.
• MAAS ensures failing hardware is never made available to users, allowing
the administrator to take appropriate steps to solve underlying problems.

2
How does it work?
The ephemeral Ubuntu environment used for hardware testing is delivered
via PXE boot of the machine. It is the same environment used for tasks such as
commissioning, deployment or rescue mode. Once booted the machine is
instructed to run each selected test via cloud-init and the MAAS script runner.

Installing dependencies
In order to run tests, each script may require software dependencies that are
not available in the ephemeral environment. For this reason, each test can declare
whether it requires software to be installed beforehand, and supports the
different sources:

• apt – to indicate this is a debian package from the configured archive


• snap – to indicate this is a snap to be installed
• url – provides a url where to obtain the software

Once the dependencies are installed, the tests will be executed.

Tracking progress
To track progress, every two minutes the script runner inside the ephemeral
environment signals the status of each script that is currently running to MAAS.
The ephemeral environment periodically pings MAAS to ensure machines don’t
run indefinitely. If MAAS does not receive a ping after 10 minutes, MAAS assumes
hardware testing has caused the system to lock up. Consequently, MAAS marks
this test as failed and updates the status of the machine.

Tracking results
After running the tests, MAAS can collect metrics, logs and the results
of each test. The results will ultimately determine the status of the machine:

• Pending – The test is slated to run but has not begun yet
• Installing – Dependencies are being installed for the test before it begins running
• Running – The test is currently running
• Passed – The test has completed successfully
• Failed – The test failed
• Failed installing – Dependencies required for the test to run were unable
to be installed
• Timed out – The test ran past its specified run time or MAAS did not receive
a heartbeat from the machine being tested for over 10 minutes
• Aborted – The test was aborted before completing
• Degraded – The test has completed and the machine is usable but the test
detected degraded performance
• Skipped – The test is unable to run on current hardware

Additionally, MAAS captures and logs all script output, which is later sent to
MAAS for storage. A test can also report metrics on completion, which MAAS
can display over the UI and API.

For more information, please refer to https://docs.maas.io/2.5/en/nodes-scripts


3
What tests does MAAS provide?
MAAS comes with a standard set of tests that are primarily focused on
testing storage, memory and CPU. It also provides other tests to verify
network connectivity.

Storage
The storage tests are designed to verify that the underlying storage has no
errors. Storage tests can be also be divided into two main categories –
destructive and non-destructive. Destructive tests will overwrite the contents
of a storage device. Destructive tests may not be run on an already-deployed
machine. Storage tests are described in the following table:

Name Category Tags Description


smartctl- Storage, Run the SMART self-test and validate SMART
validate commissioning health commissioning on all drives in parallel.

smartctl- Storage Run the short SMART self-test and validate


short SMART health on all drives in parallel.

smartctl- Storage Run the long SMART self-test and validate


long SMART health on all drives in parallel.

smartctl- Storage Run the conveyance SMART self-test and


conveyance validate.

badblocks Storage Run badblocks on disk in read-only mode.

badblocks- Storage, Run badblocks on a disk in read/write


destructive destructive destructive mode.

fio Storage, Run Fio benchmarking against selected


destructive storage devices.

Memory
The memory tests are targeted to find issues with RAM. MAAS provides the
following tests:

Name Category Tags Description


memtester Memory Run memtester against all available userspace
memory.

stress-ng- Memory Run stress-ng memory tests for 12 hours.


memory-long

stress-ng- Memory Run stress-ng memory tests for 5 minutes.


memory-short

4
CPU
The CPU tests are focused on testing the reliability of the CPU, but also provide
the ability to run CPU benchmarking:

Name Category Tags Description


stress-ng- cpu Run stress-ng memory tests for 12 hours.
cpu-long

stress-ng- cpu Run stress-ng memory tests for 5 minutes.


cpu-short

7z cpu Run 7zip CPU benchmarking.

Connectivity
The following tests are designed to validate proper network configuration:

Name Category Tags Description


internet- network, internet, Check if the system has access to the internet.
connectivity node

ntp network, ntp, Run ntp clock set to verify NTP connectivity.
node

5
Testing at scale
With a fundamental understanding of how hardware testing works in MAAS,
the next step is to address testing servers at scale.

Before diving into an example, a few things are worth noting:

• Hardware testing can be run in any machine, even in ‘Deployed’ systems.


MAAS will prevent ‘destructive’ tests in ‘Deployed’ systems.
• Machines that have failed testing can still be used, but MAAS will continue
to highlight that there has been a failure.
• After successful testing completion, machines in ‘New’ or ‘Ready’ state
will be made available, in case of an unsuccessful test result they will be
marked as ‘Failed testing’.
• Similarly ‘Deployed’ machines will return original state or will be marked
as ‘Failed testing’ in case of test failure.

Commissioning and testing


Transitioning to practice, in this example, there are 14 machines that have been
previously added. These machines are all in ‘New’ state and have never had tests
run on them. Since this is the first time, these machines will be commissioned
with hardware tests.

To start, all 14 machines are selected and set on the ‘Commissioning’ action. This
action gives the ability to select both commissioning options and select hardware
tests. To extend the default test cycle with smartctl-validate, select additional
tests such as smartctl-short for storage, 7z and stress-ng-cpu-short for CPU, and
stress-ng-memory-short for memory.

6
At this point, MAAS will start the commissioning process, and as soon as it
completes, it will automatically start the testing the machines. MAAS will update
the UI with feedback about the status of the machines and whether tests are
running or pending for each of the major components, as per the image below:

Machine listing page

MAAS will also provide feedback in the machine details page for each of the
major components (CPU, memory and storage) in the ‘Machine summary’ tab, but
also on the ‘Hardware tests’ (which are explored in a later section).

Machine details page – machine summary

Upon successful completion machines will become usable again (whether that
means in ‘Ready’ or back to ‘Deployed’ state depends on the original state of
the machine).

Please also note that machines in the ‘Ready’ state can also be tested with
the ‘Test hardware’ action.

7
Testing deployed machines
Machines in the ‘Deployed’ state can also be tested by executing the ‘Test
hardware’ action. The main difference in testing deployed machines is that MAAS
will prevent the execution of destructive tests.

It is important to remember that destructive tests will cause irreparable damage


to file system. This is typically applied to tests that write to the disk, or test the
disks themselves in a read/write manner. By default, MAAS only includes one test
that is destructive: the badblocks-destructive test. As evidenced, MAAS will
prevent this test from ever being run in a deployed system.

Testing failures
When a machine fails testing, it is transitioned to a ‘Failed testing’ state and
MAAS will highlight which component has failed (CPU, RAM or storage), as per
the figure below:

Machine listing – Failed testing

NOTE: For the purpose of this example, a custom test (stress-ng-cpu-custom) was created to simulate
a test failure.

8
To investigate the test failure, the ‘Machine details’ under the ‘Hardware tests’
tab will show which tests have failed (and which succeeded). In this case, it is the
stress-ng-cpu-custom, detailed in the figure below:

Machine details – Hardware tests tab

To further investigate the failure, the ‘View log’ link provides all the information
collected about the test which can help the administrator determine the reason
of the failure.

NOTE: If test failures are false-positives, administrators can re-execute the process. If the test
succeeds, the hardware will be set to ‘Ready,’ show no errors, and it will keep track of all previous
tests that were run, whether successful or not.

9
Overriding failed tests
Even if a machine has ‘Failed testing,’ administrators may want to continue to use
machines because the test result does not affect their use-case, or the failure
was a false positive. However, in order to make this machine usable again and get
it back to its previous state, the administrator will have to manually confirm that
they want to do so. This is done with the action ‘Override failed testing.’

Once the machine has been overridden, MAAS will show a warning and continue
to display the component that failed to warn the user that this machine failed
testing. The main difference is that on a deployed system, the machine will be
transitioned back to ‘Deployed,’ whereas in a system that was in ‘New’ or ‘Ready’
state (or any other), the machine will be transitioned to ‘Ready’ state, as per the
figure below.

10
Conclusion
Hardware testing is a critical step in the infrastructure’s hardware life cycle, and
organizations have to ensure this is a priority for the continued success of their
operations. By doing so early, organizations will ensure their infrastructure is
top-notch to answer their application demands, reducing the risk of failures on
production and peak times that could put at risk their whole infrastructure.

MAAS is the answer to operational efficiency not only to deploy applications to


production but also provides a flexible hardware testing framework that allows
administrators to seamlessly test their hardware (with MAAS provided tests)
before putting it into production. Furthermore, this framework allows for easy
integration with custom tests that allows administrators to integrate any desired
test into their workflow.

As such, MAAS is the answer to operational efficiency by combining these


capabilities and making it part of the hardware onboarding process, ensuring its
ready for operational efficiency and production.

For more information about creating testing scenarios


improving infrastructure reliability, please visit:

https://docs.maas.io/2.5/en/nodes-scripts
https://maas.io

© Canonical Limited 2019. Ubuntu, Kubuntu, Canonical and their associated logos are the registered trademarks
of Canonical Ltd. All other trademarks are the properties of their respective owners. Any information referred
to in this document may change without notice and Canonical will not be held responsible for any such changes.
Canonical Limited, Registered in England and Wales, Company number 110334C Registered Office:
12-14 Finch Road, Douglas, Isle of Man, IM99 1TT VAT Registration: GB 003 2322 47

You might also like