Title: "Self-Organising, Self-Managing Heterogeneous Cloud"
ACRONIM: CloudLightning
CALL: ICT-07-2014: Advanced Cloud Infrastructures and Services



Currently cloud infrastructures are mostly homogeneous -- composed of a large number
of machines of the same type -- centrally managed and made available to the end user
using the three standard delivery models: Infrastructure-as-a-service (IaaS), Platform-as-
a-Service (PaaS) and Software-as-a-Service (SaaS). As clouds increase in size and as
machines of different types are added to the infrastructure in order to maximise
performance and power efficiency, heterogeneous clouds are being created. However,
exploiting different architectures such as graphics processing units, many integrated cores
and data flow engines, poses significant challenges.To efficiently access heterogeneous
resources and, at the same time, to exploit these resources to reduce application
development effort, to make optimisations easier and to simplify service deployment
requires a reevaluation of our approach to service delivery.The evolving complexity of
the cloud ecosystem will eventually render traditional cloud management
techniques ineffectual. Self-organisation and self-management are powerful techniques
for managing complexity.
Some of our initial simulation results using self organisation and self optimisation
indicate that these can be used as the basis of a new cloud management and delivery
model capable of efficiently dealing with issues that arise at scale. Our preliminary work
has centred on promoting access to power efficient heterogeneous resources by shifting
the deployment and optimization effort from the consumer to the software stack running
on the cloud infrastructure.
With CloudLightning, we propose to extend this work and to build a cloud management
and delivery infrastructure based on these principles.Given the prohibitive expense
associated with empirical experimentation on hyperscale cloud infrastructures, data
gathered on our testbed will be used to simulate this infrastructure and to evaluate the self
organisation approach in that context.

We propose to create a new way of provisioning heterogeneous cloud resources to deliver
services, specified by the user, using a bespoke service description language. Due to the
evolving complexity of modern heterogeneous clouds, we propose to build our system
based on principles of self-management and self-organisation. Service descriptions,
provided by prospective cloud consumers, will result in the cloud evolving to deliver the
required services. The self-organising behaviour built into, and exhibited by, the cloud
infrastructure will result in the formation of a number of potential resource coalitions
capable of meeting the service needs. These coalitions will typically be composed of
heterogeneous components and thus the quality of service that each could deliver will
differ. The user will choose from these offerings and the successful coalition will be
commissioned to deliver the service.
An important objective in creating this system is to remove the burden of low-level
service provisioning, optimization and orchestration from the cloud consumer and to vest
them in the collective response of the individual resource elements comprising the cloud
infrastructure. A related
objective is to locate decisions pertaining to resource usage with the individual resource
components, where optimal decisions can be made. Currently, successful service delivery
relies heavily on the over-provisioning of resources. Our goal is to address this inefficient
use of resources and consequently to deliver savings to the cloud provider and the cloud
consumer in terms of reduced power consumption and improved service delivery, with
hyperscale systems particularly in mind.
The main contributions of these investigations include:
 Recasting the HPC use case applications as container applications.
 The design and realization of a Service Oriented Architecture for an heterogeneous
cloud with separation of concerns.
 To create a self-organisation solution to address the problems of centralised resource
management at scale, and the inevitable added complexity resulting from supporting
heterogeneity in the cloud.
 A description of the effects of self-organisation and self-management strategies on the
surrounding environments and how the objectives of these strategies are affected by
feedback emanating from those environments.
 A demonstration of how the architecture reconfigures, through the process of self-
organisation, to achieve dynamic stasis in balancing the tendency towards a global
goal with the physical constraints captured by the specific characteristic functions.
 A mechanism to abstract the underlying resources, and combine resources into
coalitions to support task level parallelism.
 A number of strategies for coalition formation.
 A design for a uniform mechanism for attaching heterogeneous resources to the
system, supporting both directly managed and indirectly managed resources.
 An integrated Telemetry and Monitoring system for CloudLightning, providing a
versatile and pluggable architecture to add custom plugins for special hardware types
such as MIC, GPU and DFE - leveraging the hardware vendor tools and libraries.
 Methodologies and tools for extracting the metrics from the heterogeneous resources
using built-in tools and interfaces for integrating with the CL Telemetry and
Monitoring System.
 An SDL for describing CloudLightning application Blueprints, including services
constituting HPC workloads and a framework for the supporting those applications,
while maintaining the separation of concerns between application lifecycle
management and resource management.
 A description language for specifying HPC workloads and for capturing general
application Blueprints.
 A broadening of the definition of resource type to include software components,
specifically resource managers managing subsystems (including HPC environments).
 A Plug & Play architecture incorporating dynamic Resource Grouping strategies
capable of affecting the internal evolution of the SOSM system.
 A design and implementation of low-level, hardware interfacing, telemetry plugins
for gathering metrics from the heterogeneous resource fabric.
 A design and implementation of a Telemetry Interface, which decouples the
CloudLightning system from commodity/bespoke Telemetry Servers and thus ensures
that the CloudLightning system is independent of any telemetry framework that might
be associated with new and emerging hardware.
 A design and implementation of SOSM plugins for negotiating the registration and
deregistration of resources with the SOSM system.
 The construction of a bespoke CloudLightning Simulator.
 A series of experiments to examine the parallel performance and scalability of the
simulation platform.
 In depth experiments on the performance of the CloudLightning system and
comparisons to traditional cloud systems.