You are on page 1of 23

The Design and Architecture of AMPSAn Asynchronous Middleware for Protocol Servers Farhan Zaidi, Zakaullah Kiani Advanced

IMS Inc. Ontario Canada Abstract


|We present the architecture and design of a high performance application development framework. We call this platform Asynchronous Middleware for Protocol Servers AMPS. The design of AMPS has been motivated by the need to implement highly concurrent servers for application level protocols in telecommunication and networking domains. The key idea is to reduce the development time of an application server by providing programming interfaces for tasks common to most if not all non-trivial servers. AMPS is primarily based on an asynchronous, event-driven programming model. It uses non-blocking I/O interfaces commonly available in operating systems. It uses multithreading only when either blocking I/O becomes unavoidable e.g. due to the unavailability of non-blocking interfaces in the underlying operating system for a particular type of I/O operation, or when the application needs to take advantage of multiple CPUs. AMPS is especially suited for implementing protocol servers like SIP, H.323, Diameter etc. We have implemented a SIP registrar and proxy server on top of AMPS. This SIP server is able to handle several thousand concurrent SIP sessions and registration requests (well above 50,000) arriving at the server at close to 1000 session establishment requests per second without letting any session experiencing a timeout for any SIP message. This performance has been achieved on common desktop machines having a 3.0 GHz Pentium IV processor and 1GB of RAM. The implementation of AMPS Application Programming Interface (API) is available as open source under GNU Public License. This paper describes the architecture and design of AMPS, why different design decisions were made, and how different mechanisms are implemented in the platform. The design of SIP server shall be described in a separate document. We believe that AMPS can also be used to implement web servers and any other type of application servers. AMPS is portable across operating systems as it has been designed from the ground up keeping portability in mind. We have released AMPS for Linux 2.6 kernel. A Windows version will also be released soon.

Introduction
Implementing a commercial grade server software of a telecommunication or networking application protocol is a highly complex and time-consuming task. There are however, a large number of aspects from programmers standpoint that are common to all protocol server software. Examples of such tasks are memory and buffer management, timer management, dealing with file I/O, interacting with databases, performing overload and admission control, caching of different objects of information in memory, and replicating or load balancing multiple server instances across different machines to provide fault tolerance, reliability and scalability etc. Taking this commonality into consideration, we have designed and implemented a server software development framework called AMPS that abstracts the server applications from several tasks that are required in the development. Using this framework, an application developer may concentrate on the specifics of the protocol rather than the implementation of tasks mentioned above. She

would concentrate on parsing and creation of protocol messages and the finite state machine of the protocol. All other complexity would be abstracted inside the framework. While designing this framework, the most fundamental question we came across was the choice of the programming model. This choice is the basic architectural decision that has a huge impact on server performance, scalability and throughput. Performance bottlenecks resulting due to a particular programming model cannot generally be corrected by any code optimizations in later stages of life cycle of the software. The following section describes the AMPS programming model, various design alternatives that were considered, and explains the rationale behind the final choice.

A comparison of common programming models


The common programming models for server development in use today are multithreaded, event-driven or a hybrid of multi-threaded and event-driven. Although there are other models as well e.g. the multi-process or multiple address spaces model, but we will restrict our discussion to the more common ones for brevity. Multi-threaded model The most commonly used model for server software is the multi-threaded model. In this model, an application is modeled as a collection of threads. Thread-based programs use multiple threads of control within a single address space. Threaded programs achieve concurrency by suspending a thread blocked on I/O and resuming execution in a different thread. In this model, the programmer must carefully protect shared data structures with locks or mutexes. Furthermore, there is frequent need to coordinate and synchronize the progress of different threads with respect to each other. Condition variables, semaphores and other similar primitives are used for these purposes. In multi-threaded designs, each thread is usually part of a thread pool and is responsible for one single session of the protocol being implemented. This one thread per session model is the most common in use today in commercial and non-commercial servers. Some variations on this model try to bound the number of threads created as sessions grow in number, by mapping N sessions to M threads where M < N. Such models usually introduce some form of queuing to cater for overload conditions. Context information of protocol sessions is mostly kept on the stack of the thread handling the session. As mentioned before, global or system wide data structures are protected against race conditions via mutexes and other similar primitives. Drawbacks of multi-threaded model Several researchers have shown that multi-threaded model has serious drawbacks and weaknesses when it comes to achieving high performance and ease of use. The main drawbacks fall into two categories: a) The need for synchronization between threads and the need to take care of race conditions and provision of mutual exclusion when accessing shared data structures. This invariably results in subtle bugs that may lead to deadlocks, nonobvious race conditions and complexity of code. It is generally believed that an average programmer understands threaded paradigm better and is thus more comfortable with it. However, research [1] has shown that this paradigm, due to its inherently non-deterministic nature, forces programmers make mistakes almost invariably when using synchronization and mutual exclusion primitives. It has been found close to impossible for average programmers available in the market today to write correctly synchronized concurrent code.

b) Extensive use of multi-threading also leads to performance bottlenecks in conditions of high load. As the threads grow in number, synchronization and mutual exclusion overheads become significant beyond a certain point. This results in inefficiency due to unnecessary blocking. By unnecessary we mean the blocking required due to synchronization primitives, for example when hundreds of threads are blocked on semaphore queues, and not because of waiting for some I/O event to occur. c) The creation of a large number of threads to manage concurrency under heavy load results in memory and CPU resource bottlenecks. As threads grow in number, beyond a certain point, the context switching and memory requirements in the kernel due to stack management becomes significant. Event-driven model In contrast to threads, pure event-driven programming is a single thread of control based paradigm. Event-based programs are organized around the processing of events. Events happen asynchronously and the program does not wait or block ever for any event to complete. Before we go into the details of the model, we need to look at what events really are, especially in the context of protocol sessions that a telecommunication or networking server would have to handle. Event: An event is any stimulus that causes change in the state of a system. An eventdriven system can therefore, be easily modeled as a Finite State Machine (FSM). The whole application acts as an FSM that moves from state to state based on the stimuli provided to it in the form of events. From the perspective of protocol servers, examples of events are: Arriving network messages I/O operations reaching completion Firing of timers when their scheduled timeout occurs

To handle each event type, an event-driven program is written as a set of event handler functions. These functions are also called callbacks. There is a single main thread of control in the program that usually waits in an infinite loop for any events to occur. Lets call this main loop the controller. When one or more events occur, the controller wakes up from its waiting state and calls the appropriate event handler for each event serially. It is therefore necessary that the controller knows about event handler for each event. This can be accomplished by a registration mechanism where a software entity or module registers an event handler for a particular event with the controller. Each event handler is passed event related data as argument. For example, event handlers may need protocol session state and other session related information to process the events. They may also need the dynamic data generated due to the occurrence of events e.g. the network message that arrived and caused a particular event. Two requirements need to be met in order to pass relevant data to event handlers: 1. The software entity that registered the event handler should be able to pass event specific data as part of the registration if desired. This event specific data would act as context or state information for the entity performing the registration.

2. The software entity that generates or notifies an event should be able to pass some data as arguments to the event handler. Note that some event handlers may not require any data; some may require only dynamic data, while most would require dynamic data as well as context or state information stored as part of the registration. For example, a network receiver event handler, invoked to receive a network message on a network connection may not need any state information since its job may just be copying bytes from the network and determine the next event to generate. After receiving the message, the network receiver may parse the network message to determine the next event type to be generated, and then generate a next event, passing network message as the event specific data. The user of this next level event may have registered some state information while registering its event handler. Both, the network message and the state information would be passed as arguments to the next event handler. The next level handler could continue further processing by performing the necessary computation or I/O required in the current session state, updates various session variables, transition the FSM to the next state, and may generate yet another event. After one event handler returns, the controller calls the next event handler and so on till there are no more events to be processed. The controller then goes back to its waiting state. Non-blocking operation In order to perform optimally, it is imperative for the event-driven system that an event handler never blocks waiting for any activity. If an event handler needs to perform I/O activity e.g. to construct and send a message to the network, it would usually invoke a system call of the underlying operating system. If it blocks during this system call, then all other event handlers that were to be called after the blocked handler, would be delayed. In a busy system there could be hundreds of events occurring concurrently. Any blocking would make the servers response time to increase linearly with the number of concurrent events. If however, the event handlers could somehow guarantee non-blocking operation, then the system would give close to optimal performance. This is because no event handler would ever incur any overhead other than performing its required processing on the CPU. The controller would process events basically in First-in-First-Out (FIFO) order. In fact other scheduling disciplines could be deployed as we will see later but FIFO is a good assumption to keep the current discussion simpler. Whenever an event handler performs an I/O operation it must find out whether the operation could be completed without blocking. If the operation would block for any reason, it must register another event handler and return. This new event handler is tied to the condition when the desired operation could be performed without blocking. The returning handler may also store some state or context information with the newly registered handler. When the desired condition is met i.e. the operation could be completed without blocking, the new handler is invoked and passed the stored context so that it could continue its operation. This essentially means that the programmer may have to divide the code into several event handlers; each executing indivisibly till it hits a blocking operation. Note that the same thing occurs in threaded model where a thread may block while performing I/O. Its context information is on its own stack, and the operating systems

thread scheduler gives the CPU to another thread. However, there are two main differences: 1. In event-driven systems, the context switch happens voluntarily as each event handler cooperatively relinquishes control back to the controller when it hits a blocking operation. In threaded model, operating system has to perform the context switch whenever a thread blocks. The result is that event driven programs incur the overhead of a return from a function call instead of an OS level context switch. A return from function call is several times cheaper than the context switch, even for light-weight processes such as threads. 2. The event handlers are always called serially. Therefore, access to global data structures e.g. session tables etc. is always serialized. There is no need to protect any data structures via mutexes etc. Furthermore, each event handler operates independently on state information. Any two event handlers operating serially on the same state information could synchronize their progress if required, via the state information object itself. There is again no need to synchronize via condition variables or semaphores etc. Drawbacks of event-driven programming The major drawbacks of event-driven programming are as follows: 1. The often cited major drawback of event driven programming is the complexity of code that results by the use of this model. Each event has to carry the context or the associated data on which the handler has to operate. The programmer has to manage potentially a large number of contexts, possibly look them up inside the event handler to determine which context or protocol session the event belongs to, and then operate on that context. Furthermore, the programmer has to be aware of the operations that may result in blocking. The code must be divided into indivisible chunks called event handlers with one chunk scheduling the other one before returning back to the controller. This increases complexity for the programmer. However, we believe that this complexity is quite manageable through proper abstractions and interfaces. Much of the detail can be hidden from the programmer by designing powerful abstractions. This is exactly what we have achieved in AMPS. AMPS provides an Application Programming Interface (API) for performing asynchronous I/O on top of non-blocking I/O interfaces widely available in common operating systems. 2. Another drawback of this model is that one cannot make good use of multiple CPUs if available in the underlying hardware platform. This is because of only one thread of control implementing the controller. Performance of multi-threaded programs may be enhanced transparently since the OS would usually take care of scheduling different threads on multiple CPUs. Techniques to make use of multiple CPUs are discussed in a later section of this document. 3. One of the major limitations of event driven servers till recently has been the limited scalability of event notification facilities available in common operating systems. For event driven systems to work, we need an event notification facility provided by the operating system, and ideally, also asynchronous I/O interfaces. Asynchronous I/O interfaces are not widely available in commercial operating systems today. Some asynchronous I/O libraries are available on Linux and UNIX

flavors e.g. aio_xxx, but their implementation internally use threads. The event notification facilities like select() and poll() etc. have major performance issues. Their performance degrades drastically beyond a relatively small number of file descriptors. However, some event notification facilities recently added in operating systems e.g. epoll() in Linux, /dev/poll in Solaris and IOcompletion ports on Windows scale an order of magnitude better than select() and poll(). AMPS Linux implementation uses epoll() for event notification. The windows version used IOCompletionPorts. Scheduling of events In our opinion, the greatest advantage of event driven model is that it enables an application to perform its own application level scheduling. In the threaded model, threads are scheduled usually by the operating system. The application has little control over scheduling. Operating system schedulers are mostly generic in nature and may not provide optimal scheduling for each application. Application knows about what it needs in terms of response time, throughput and its breakdown of tasks a priori. Therefore, giving full control of scheduling to application would most likely result in better performance than letting the OS take care of it. In event driven model, the need for scheduling arises when multiple concurrent events occur. The controller wakes up and finds out that there are multiple events waiting. Now the controller could invoke the services of a scheduler function, asking it to output an ordering in which the events would be processed. The scheduler could deploy any algorithm or heuristic that suits the application. For example, it could assign priorities to events and output a priority list. It could do it in FIFO order, round-robin etc. like conventional schedulers, or it could use more advanced techniques like group or batch scheduling of same and similar type of events to maximize instruction and data cache hit ratios. As more experienced is gained with the application especially under heavy load and busy hours, the scheduler may be improved in isolation without modifying other parts of the application.

The design goals of AMPS


AMPS is based on the event driven model, while still taking advantage of threaded model in certain cases. The key design goals of AMPS can be listed as follows: The architecture must not use any locks, mutexes, semaphores, condition variables etc. whatsoever. The architecture should be completely modular and extendible such that addition of new features and modules should be seamless and relatively easy exercise. The architecture should provide APIs for memory and buffer management, timer management, management of transport level connections, sending and receiving of messages over the network, managing caching of objects to increase performance, protection against overload conditions, interfacing with databases, performing file I/O for event logging and tracing, taking advantage of multiple CPUs, and most importantly, management and scheduling of events. The architecture must be able to perform adequately on common off-the-shelf hardware. The architecture should be scalable such that the only real bottleneck at the architecture level should be the CPU speed and the amount of main memory

available in the system. Moving to a faster CPU and more main memory should increase the performance of the application proportionately. The following sections describe the design of AMPS in detail, how applications are built and organized, how events are managed and scheduled, and how different components of the architecture have been implemented.

Motivation and relation of AMPS to similar research systems


AMPS design has been motivated and inspired by several research projects. Most notable among these are an article by Ian Barile that appeared in Dr. Dobbs journal [3], libasynch [4], SEDA [5], Ninja [6] and the Flash web server [7], and to name a few.

The anatomy of an application that uses AMPS


The core of an application built on top of AMPS is modeled as a single, non-blocking, event-driven process. The salient features of the application are listed below: The application is divided into a set of modules. Each module is further divided into a set of functions called event handlers. Each event handler is created to handle a particular event. Events can be generated as a result of occurrence of I/O operations, or due to one module trying to communicate with another module. The communication model between modules used by AMPS is based on rendezvous based abstraction. In this model, senders are de-coupled from receivers. Senders do not know whom they are sending an event to, and receivers do not know who sent them an event. Receivers express or register their interest in receiving the events they want to receive. Senders just generate events at the appropriate time, hoping that each event will be received by all modules interested in that particular event. This type of communication model enables seamless integration of new modules quite easy. The whole system becomes a service-oriented architecture. Each module provides a set of services, and requires a set of services from other modules. A module could either provide service when solicited or requested, or it could provide service asynchronously. Service itself could be some task or function required by one or more modules in the system, or it may only be some information disseminated throughout the system picked up by anyone interested in it. Note that at the architecture level, the producers and consumers of service do not need to know about each other. However, if at the application level, they want to find out about each other, they may use the data transferred in each event for this purpose. This keeps the architecture clean and maintains its simplicity.

Event types in AMPS An event has an event type, and may contain associated data to be passed to the receiver. For example, most events, when generated may pass context information to determine which session or connection the event belongs to. The following types of events can be generated in the AMPS model:

Request events These events represent a request for a particular type of service from another module, directed towards any module that could provide the service. The requestor does not know at the time of generating the request event which module is going to serve this request. It just blindly generates the request event, hoping that there is a module in the system that could provide the response. The requestor however registers its interest in the response event corresponding to the request. Response events These events represent a response to an earlier generated request. They are always generated in response to a request. These are generated by modules who registered their interest for a particular request. Of course these would most likely register their interest in a request event, only if they are able to process the request and provide a response. When they receive the request (due to the fact that they registered interest in it) they generate the response event after request processing is complete. Notification events These events represent asynchronous notifications. They may be generated by modules that want to provide some information or a service for which no request is necessary. In other words, the information or the provided service is unsolicited. If it is of importance to any other module, that module would register its interest in it and thus receive the notification event. Event Management in AMPS The following entities are associated with the event management in this model: Listeners These are modules that want to receive an event whenever it will be generated. They register their interest in a particular event using the AMPS API. They may register their interest in request, response or notification events. When a module registers for an event as a listener, it must provide a callback function. This function is called when the event is delivered to the module. Notifiers These are modules that want to send requests, responses or asynchronous notifications events. They generate request, response or notification events, which in turn are propagated to listeners by the AMPS event management system. Event Manager This is the central component of the event management system. It manages registrations coming from listeners and notifications from notifiers. When an event is generated (notified), the event manager arranges for the delivery of event to listeners. If multiple events are generated concurrently, the event manager invokes the event scheduler to deliver them to listeners in a certain order. When the application designer wants to add a new module into the system, he/she needs to perform the following steps: a) Determine the services this module would provide to the rest of the system. b) Categorize the services provided by the module into response events (the events the module would generate in response to requests) and asynchronous

notifications that the module would generate to announce some information to the rest of the system. c) Determine the services this module would need from other modules. d) Categorize the services this module would need from other modules into requests the module would make to the rest of the system, and asynchronous notifications it would need to process if supplied by any other module in the system. e) Add new request, response and notification events in the application if required as a result of step b. f) Write an initialization function that: Registers itself as a listener for the request events determined in step (b) above so that the request events it serves could be routed to it. Registers itself as a listener for responses to the requests generated by the module itself determined in step (d) above so that the responses to its own requests could be routed to it. Registers itself for asynchronous notifications it wants to process determined in step (d) above so that those notifications could be routed to it. g) Write event handlers for all the events the module has registered for as a listener. h) Write event handlers in other modules who need services provided by this module as determined in step (b) above. Event registrations can be done either in the initialization functions, or from inside event handlers as well since the framework provides an API for registration of events that could be called from any function. There is no restriction that it would be called from initialization function only. The following figure illustrates the communication paths between different modules.

In figure 1 (a), module 1 generates a request event R1. Module 2 is registered for the request R1. The event scheduler delivers the event to module 2 (meaning that it calls the event handler registered by module 2 for R1). Module 2 processes the request and at

some point it generates a response event RES1. Module 1 is registered for RES1, so the event scheduler delivers it to module 1. In figure 1 (b), module 1 wants to disseminate some information to all modules interested in such information. It generates a notification event N1 that carries the information it wants to generate. Module 2 and 3 are both registered for N1. The event scheduler delivers N1 to both module 2 and module 3.

Module types in AMPS


A module can be categorized in AMPS as of two types: non-blocking or blocking. Non-blocking modules These modules perform CPU bound operations and possibly I/O operations, but those I/O operations could be performed without blocking. AMPS provides non-blocking interfaces for almost all I/O operations for which the underling operating system provides nonblocking mode of operation. For example, on Linux and Windows, it provides APIs for sending and receiving of network messages and reading and writing of files for event logging and accounting etc. For such non-blocking modules, the steps described above are sufficient for integration of such a module with the rest of the application. Blocking modules and I/O agents There may be instances where blocking may become unavoidable. Examples include resolution of Domain Name System (DNS) queries (since most libraries available with common operating systems for this purpose use blocking mode of operation), and interfacing with databases. In both these cases, the underlying OS or library may provide non-blocking interfaces in addition to blocking ones e.g. Windows provides asynchronous DNS library calls and Oracle provides an asynchronous interface to C language. However, commonly used database and DNS interfaces are blocking in nature. When dealing with such modules, AMPS require the developer to create these modules as I/O agents. I/O agents are modules that perform blocking I/O and are I/O bound i.e. they prepare the I/O requests and then spend most of their time waiting for I/O operations to complete. I/O agents are composed of a front-end dispatcher of events and a thread pool. Each thread in the pool has an associated event queue. Threads remain blocked on their event queues when they have no work to do. The front-end dispatcher is registered as the event handler with the main event manager for all events the module has registered for as a listener. The dispatcher selects a thread from the pool and dispatches a received event to the event queue associated with the thread. This design has the advantage that large burst of I/O requests could be absorbed in thread-pools event queues. Each thread runs the same code. Once it wakes up as a result of a newly queued event on its queue, it determines the event type and then calls the event handler for the event. The event handler may block while performing an I/O operation. In that case the thread serving the event blocks but the application continues with processing other events undisturbed. The application never blocks when there is useful work to do. Since the threads implementing I/O agents are supposed to be I/O bound, they take very little CPU time and then block on the I/O. They would usually not live their time-slice given to them by the operating system. They therefore, do not create much disturbance for the progress of the main application. When the results of the I/O operations are available, the threads wake up and generate appropriate events, which in most cases would be response events. A fundamental requirement for I/O agent threads is that they must not share any data with other threads or with the rest of the application. They must communicate with other

modules via events carrying data via inter-thread communication mechanisms provided by the OS (of course AMPS provides APIs for this purpose). It is assumed that I/O agents would perform dedicated I/O related tasks such as formatting a database query and sending it over to the database, and processing the response and sending it back to the application via the events system. We do not foresee any requirement for these threads to share any data or requiring synchronization with other threads or with the rest of the application. The designer of an I/O agent has to take all the steps from (a) to (g) given in the previous section. In addition, he/she has to determine how many threads should be created in the thread pool. This may depend on how much load is expected on the I/O agent in question. For example, an I/O agent involved in each protocol session and in processing path of most of the messages coming from the network e.g. DNS resolver, may require several threads to achieve maximum concurrency under heavy load, whereas another agent whose interaction with the rest of the system is infrequent may require a single thread. As for the dispatcher function, the designer may use the dispatcher supplied by AMPS to distribute requests between threads, or write his/her own dispatcher for this purpose. Figure 2 below illustrates the flow of information between a module requesting a service and an I/O agent.

In figure 2 we have module 1 requesting a service that is provided by module 2 which is an I/O agent. Module 2 is further decomposed into a thread pool and a dispatcher function. The dispatcher function is registered as a callback with the event manager. The thread pool consists of four threads. The request event R1 generated by module 1 is routed by event scheduler to module 2 and its dispatcher function is invoked. The dispatcher selects a thread from the thread pool (thread 2 in this example) and passes the request to this thread. The thread performs the required I/O for the request. Once the I/O is complete, it returns the results by generating a response event RES1 which is routed back to module 1 since it registered for this event type. CPU agents and exploiting multiple CPUs Similar to I/O agents, CPU agents are another type of modules that have exactly the same internal design as I/O agents. They are also composed of a dispatcher and a thread pool. However, unlike I/O agents they are completely CPU bound. They perform no I/O whatsoever. They communicate with the main application via inter-thread communication mechanisms just like I/O agents. The reason for creating CPU agents might be to exploit parallelism on a hardware platform with multiple CPUs. If the application developer could isolate pieces of code e.g. procedures or collection of procedures that perform heavy CPU operations on input data, she could create a CPU agent with the number of threads in the pool at least equal to the number of CPUs available. This would let the OS exploit multiple CPUs to schedule threads, thus enhancing parallelism and consequently, performance. However, it must be ensured that the isolated procedures must not share any data with the rest of the application. In other words, they must not require modifying session or any other global state. As an example, the SIP parser or at least some of its core parts in the SIP server application is a good candidate for converting into CPU agents. Another example might be a trans-coder or encoder/decoder function in a media server. Threads that perform infrequent synchronization with each other are usually handled well by the multi-processor OS scheduler in terms of processor affinity and exploitation of parallelism. Application contexts In event management systems, state pertaining to sessions is kept in context objects. Context objects are usually kept on the stacks of threads in multi-threaded systems. In event driven systems however, these context objects must be kept globally and passed to event handlers as parameters. In AMPS, an application usually creates a hierarchy of contexts: System context: At the top level is the system context that is a global context of the complete application. System wide global state is kept in system context. Application context: System context contains a table of application contexts. This means that AMPS can support several different applications running simultaneously and independently of each other. Within application context is application level global state i.e. state information that applies throughout the application. Session context: Application context also contains a table of session contexts. For any session oriented protocol, the session context contains state pertaining to a protocol session. The event handlers modify the state in the session context as they implement the protocol session state machine. They also keep session related data structures for bookkeeping, accounting and any other information they want to store for correctly performing the protocol functions in the session context.

Event handlers may also modify application context in some cases, if there is a state that needs to be made globally available across all sessions e.g. statistics etc. Some event handlers may also modify system context as well, especially those that serve user interface related configuration and provisioning events applicable to the whole system. Event manager It should be obvious from the description of AMPS architecture so far, that event management system is at the heart of the architecture. Event manager consists of event registration mechanism, event scheduling and notification. Event registration is simple; A list of registered callbacks is maintained per event type. In addition, the event scheduler maintains a queue per event type. Events are generated using an API provided by AMPS. The API function puts the generated event in its respective queue. It does not actually call the handlers immediately. The event scheduler actually calls the registered handlers according to the scheduling policy. Main application The application built on top of AMPS performs the following functions: It performs the initialization required to setup all the I/O and computation modules required by the application. This usually involves creating all configured modules, including both non-blocking modules and I/O agents. It also sets up network transport descriptors e.g. sockets, file I/O descriptors etc. AMPS internally sets up the descriptors non-blocking for the operating system and adds them to the operating system for polling of external I/O events arriving on them. As part of this initialization, the application also registers an I/O receive function with each descriptor to perform initial application level processing on the received data. AMPS provides the necessary APIs for performing these tasks. After setting up the I/O descriptors and creating all the modules, the main application calls an API that polls the operating system for any pending external events on any of the descriptors. If there is no external event pending, the user may setup the application to wait indefinitely, make it wait till a timeout occurs, or may make it return immediately. When one or more events occur, the application wakes up and enters a processing loop. Inside the loop, it retrieves each event from the operating system, again using an AMPS API. The API retrieves each event and internally calls a network receiver processing function registered during initialization for the descriptor on which the event occurred. Note that the application level event type is not yet known when a network or other I/O event arrives at a descriptor. It is the job of the initial processing function to determine the application level event type for each I/O event arriving on the I/O descriptors. Once the application level event type is known, that event is generated. The event generation mechanism puts the new event into the queue for that event type. This way, arriving I/O events are processed by the main application and corresponding application level events collected in event queues. Once all external I/O events have been retrieved and their corresponding internal events generated, the main application invokes the event scheduler. On our Linux implementation, we use the new epoll event notification system call instead of conventional select or poll APIs in AMPS. epoll is far more scalable than select and

poll. We have found it to scale to several thousand network events arriving at significant rate (higher than 1000 per second) for several hours. In our opinion therefore, this is the most scalable event notification facility available on Linux today. Event Scheduling The event manager calls the event scheduler to actually generate events in a certain order. The scheduler goes through each event queue and calls the registered callbacks or event handlers. It may happen that the event handlers generate further internal events during their processing. Those new events would be put in their respective queues. The scheduler could serve the queues in many possible orderings. For example, it may make multiple passes through all queues repeatedly till it finds no event in any of the queues. Another possibility could be that it knows about the event queues holding events generated as a result of I/O events arriving from the outside (lets call them external events), and those generated by event handlers internally during their execution (lets call them internal events). The scheduler may serve one external event queue till it is empty, and then all internal event queues to quickly get the internal events out of the system, then move on to the next external event queue, and so on. How the scheduler works would depend upon the scheduling policy. AMPS provides a default scheduler that classifies queues into external and internal categories and schedules between them as just described. This also means that events collected in each queue are delivered as a batch. This policy may provide good instruction cache behavior for events whose list of registered event handlers has only one entry. This case is quite common for request and response type events. This policy may also result in good data cache behavior if one or more event handlers registered for a particular event access the same data structures. This type of good caching behavior is virtually impossible to achieve via thread scheduling done by OS. However, this policy may result in increased latency for other events if a large number of events of a particular type arrive simultaneously. To circumvent this, the default scheduler in AMPS uses the scheduling policy called Deficit Round Robin (DRR). In DRR, each external events queue is assigned a quantum q, which in our case for now, is specified in the number of events. In addition each external queue has an associated credit c with a maximum limit l. The scheduler algorithm is as follows: The scheduler serves q events from each external queue in each round. After each external queue, it serves all internal queues. If a queue has lesser than q events, the difference is added to its credit c if c is less than l. If there are greater than q events, then c events are served at the most. The credit c thus, serves as the burst size for a particular external event queue in conditions of high load. If c reaches its maximum limit l, then an additional events arriving for the external queue are dropped. There is no limit or credit for the internal event queues. There is no possibility of starvation since all queues would get their chance eventually. The user can modify the algorithm as desired. The event scheduler provided with AMPS serves as an example only. The user could serve events in an order of priority, do simple

FIFO scheduling, or just do a simple round robin if caching behavior is less important than providing round robin properties. As pointed out earlier, complete control of scheduling of events in AMPS has several advantages over designs that deploy multi-threaded model to let the operating system perform the scheduling. The application knows best what behavior is desirable for its purpose. Once events are collected in queues, new and novel algorithms could be applied to the queues and experimented with to improve different performance metrics of the application. Figure 3 illustrates the flow of information between different components of the application.

In figure 3, we have four network I/O descriptors that have as many network receiver functions. When network messages arrive on the descriptors, the receiver functions are called which call initial parser of messages to determine the internal event types for each message. The receiver functions then generate internal events that are en-queue by the event manager into event queues for particular message types. Each event queue holds messages for one particular event type. After all messages have been posted into event queues (queues numbered 1 to 4 in the example), the event manager invokes the scheduler to determine event ordering. It then calls the registered event handlers in an order determined by the scheduler. Event handlers 2, 3 & 4 perform the necessary processing on messages and then generate further messages to the network. Event handler 1 on the other hand, processes the event and then generates another event, passing the message along to that new event. This new event is en-queued in queue 5 by the event manager. The event is finally delivered and the registered event handler 5 is invoked. Handler 5 processes the message further and finally generates a network message to

complete the processing. Note that the programmer has to break the processing of event type 1 into two handlers, 1 and 5. Another important thing to note about figure 3 is that network messages are sent by event handlers directly, without the possibility of registering yet another event handler in case network I/O might block. This is because AMPS takes care of sending messages to sockets and other type of descriptors internally within the middleware. If the API finds that a descriptor would block due to busy error etc., it en-queues the message internally and return success to the application. It sets up the operating system to notify when the descriptor is write-able again. The write-able event occurring on the descriptor is also treated just like another event in AMPS. The registered event handler for this event (registered internally by AMPS) sends en-queued messages in FIFO order. If it again finds blocking condition while queue is being emptied, it repeats the process i.e. reregisters the event handler and sets up OS notification again. If the application sends another message to the same descriptor, the AMPS API for sending messages en-queues the new message in FIFO order if the queue is not empty. Otherwise it tries to send it right away.

Object caching
It is well established that caching of frequently accessed information is the key to performance enhancement in busy servers. A protocol server may cache information such as resolved DNS records and mappings, user registration and service profile data from the database etc. The server application would look first in the cache before making a network or database hit for information. AMPS provided a generic caching facility to cache arbitrary type of objects in memory. The user could create a cache object of any desired size. The API provides cache lookup based on opaque keys and application defined comparison function. The internal data structure used to implement the cache is a binary heap, also well known as priority queue. We believe this mechanism can be put to good use to enhance performance and scalability by an order of magnitude under conditions of heavy load. Timer Management Timer module is one of the key components of a protocol server. Almost all non-trivial protocols involve timeouts in their design. One of the biggest problems with software timers in multi-threaded systems is that they go off asynchronously, making the protection of session related data structures unavoidable. Even if per session data is isolated from other threads by storing them on the stack variables of the threads, callback functions invoked on timer expiry must access those data to update session state. AMPS provides the basic timer management APIs for the application developers. The key design goals of timer management APIs in AMPS are as follows: Timer implementation must be efficient. There should be no searching or sorting involved in any of the timer operations. Starting, stopping, and expiry processing of timers must be O(1) operations. The timer implementation must be scalable. The timer module must be able to manage long timeouts (of the order of at least 24 hours or more) without requiring large memory footprint.

Timer callback functions must be called serially so that there is no need to protect access to shared session related data structures. Timer must be precise within a certain reasonable resolution (in the order of a few milliseconds). The shortest protocol related timeouts are usually in the order of milliseconds.

AMPS is able to meet all the above goals in its timer implementation. The following section describes the design and implementation of AMPS timer module: AMPS timer module AMPS treats the timer as a high priority event in the system. The timer-tick event happens every k milliseconds where k is the resolution of the timer module. By default, k is 1 millisecond. A separate thread runs at a higher priority than any other thread in the application, including the main application thread, I/O agents and CPU agents. This thread sleeps for k milliseconds, then wakes up and sends an event to the main application via IPC mechanisms. It then goes back to sleep for another k milliseconds. The main applications event manager processes the timer-tick event at a higher priority than any other event in the system. Also, after processing any event, it polls the timer event to see if the processing of the last event spanned more than k milliseconds. This way, it is guaranteed that no timer event would have to wait for more than the processing of a single non-timer event. This mechanism also makes it easy to determine which event (if any) is taking more than k milliseconds for processing. If the application developer detects such an handler, that one could then be divided further into multiple events so that each sub-component takes at most k milliseconds. The timer creation API allows the user to specify a callback function and an opaque data handle or pointer to be passed as argument to the callback. The timer data structure Timer is implemented as a data structure based on the timing wheels idea presented in [8]. The data structure contains hierarchical arrays, each representing a counter of a digital clock.. This concept has previously been used in implementing Linux kernel timers. We have used four such arrays in AMPS. The first array represents milliseconds. Each slot in this array represents one tick or k milliseconds. The array contains 1000/k entries. This means that this array signifies the passing of a total of one second of time. This also implies that the milliseconds part of the timeout must be a multiple of k milliseconds. The next array represents seconds. It has 60 entries and signifies the passing of a total of one minute of time. Similarly, the next array represents minutes, has 60 entries and signifies a total of one hour of time. The last array represents hours, has 24 entries and signifies a total of one day of time. Each entry in each array is a doubly linked list of timer callback functions. At each timer tick event, the event handler for the tick event traverses the list contained (if any) at the current array index, calling each callback function serially, and finally increments the array index by one to point. When the index of milliseconds array reaches the end of the array, the event handler replenishes the milliseconds array from the callback functions contained in the current index of the seconds array, and increments the index of the seconds array by one. The milliseconds array would then contain the timers due to expire within the next second. The event handler then increments the index of the seconds array

by one. When the seconds array index reaches its end, the event handler replenishes it from the list of callbacks contained in the current index of the minutes array, and increments the index of the minutes array by one. The seconds array would then contain all the timers due to expire within the next minute. Similarly, the minutes array is replenished by hours array. This way, the milliseconds array index is incremented every tick, seconds array index incremented every second, minutes array index incremented every minute and hours array index incremented every one hour. Note that the callbacks that are actually called are always contained in the milliseconds array. To make the discussion concrete, lets consider the following example: The user wants to start a timer due to expire at 2 hours 45 minutes 30 seconds and 80 milliseconds. Lets suppose that k is 10 ms. Since hours are greater than zero, the timer callback would first go in the hours array. The API would add 2 to the current index of the hours array, and add the timer to the doubly linked list in that slot. When the index of the minutes array would reach the last entry, the handler would replenish the minutes array with the current index of the hours array. However, since our example timer is to go off after 2 hours, the turn of its slot would come after two hours i.e. after the index of the minutes array has reached its end twice. At the end of two hours, the event handler would process each entry in the list of the current index of hours array, and put our example timer in the appropriate location of the minutes array. In our example, since the minutes part of the timeout is 45 minutes, it would go into 45th slot of the minutes array. After 45 minutes, this timer would move to the 30th slot of the seconds array, and after further 30 seconds, it would be transferred to the 80/10 = 8th slot of the milliseconds array. After further 80 milliseconds or 8 ticks, its callback would be called. The timer scheme is illustrated in the figure below:

Memory management Efficient memory management is a key issue in any non-trivial software artifact. Memory handling could become a severe performance bottleneck if allocation and freeing operations for buffers involve expensive searching, sorting and re-combining of memory blocks. AMPS memory management sub-system is designed to meet the following goals: The memory management must be fast and bear little overhead for the system. The memory management should try to minimize memory leakage without incurring the overhead of a garbage collector. Of course this goal will not be too relevant if AMPS is implemented in Java or any other language in future that provides automatic garbage collection. The memory manager should acquire memory from the underlying system e.g. the C library gradually on demand. However, once acquired, it may hold memory for relatively long time to minimize memory freeing overhead. AMPS meets these goals in the following manner: Application protocol servers usually use memory in burst patterns. They build up relatively large data structures used for the duration of a particular phase of computation, and then discard most or all of those data structures [9]. The surviving data structures represent the results of a phase as opposed to intermediate values. A prime example is the processing of a protocol message that arrives from the network. The message usually enters the server, is parsed and processed, possibly modified, new fields or data may be added while some fields may be removed or updated. Memory is dynamically allocated during all stages of life time of a message within the server. This memory may not be freed till the time message leaves the system e.g. it is forwarded, replied to, or discarded after processing. The individual memory objects allocated during a message processing phase may be freed when the processing moves on to the next phase. However, if the freeing of these objects is delayed till the life time of the whole message, it may prove beneficial for two reasons: a. If we keep accumulating memory allocated in smaller chunks (for individual objects) and free all objects at once i.e. at only one place in the code, it would substantially decrease the number of memory leaks. This is clear because we would be calling free at one place instead of multiple, possibly large number of places. b. We could pre-allocate a larger buffer in the beginning i.e. when the message first enters the system, and then build a simple, extremely fast allocator on top of this buffer for our smaller objects. This is because we would not have to worry about managing free lists and other overheads for small, individual objects. Taking the example of a protocol message, when it arrives at the server, we allocate a large buffer of any appropriate size using the low level memory manager provided by the programming language. Set a pointer at the start of this buffer. This forms our context or the memory manager object for the particular message. When the processing of message starts and passes through its various stages, the application would allocate memory using a special API that would take the memory manager object just described as an argument. The allocator would simply move the pointer by the requested number of bytes, and return the address where the pointer was before the allocation request. The application could then write that buffer using the returned pointer at will. All subsequent allocations

would thus move the pointer forward. Finally we may reach the end of the large buffer before the end of life of the message in our system. If that happens, we must allocate another large buffer, attach it to the previous one, and then start allocations from this new buffer. This way we may keep on building a linked list of larger buffers embedded in the memory manager object. This linked list would be freed when the message is no longer needed in the system i.e. when the manager object would be destroyed. The important thing to note is that the higher level allocator that allocates from the large buffer is extremely fast since it only adds the requested size to a pointer variable. AMPS provides an API to create a higher level memory manager object with a certain buffer size. This API returns an object that is passed to another API for allocations. Memory is allocated as described above i.e. from the linked list of buffers created inside the manager object. Another API is provided to free or destroy the memory manager object. For the lower level memory manager, AMPS provides another optimization. Once a larger buffer has been allocated for a manager object using the lower level allocation function provided by C i.e. malloc, AMPS caches that buffer in its internal sizesegregated linked lists of free buffers when the manager object is destroyed. When a memory manager object is created, AMPS first checks the cached linked list for that particular size, and if found, takes one from that free list instead of calling malloc. When the manager object is destroyed, AMPS returns the freed linked list of buffers for that object to the respective cache instead of calling the free function of C library. Note that this design implies that AMPS would keep building internal free lists of relatively large sized buffers and not return them to the applications heap ever. This may result in memory depletion under heavy loads e.g. when huge number of protocol sessions are concurrently in progress. To avoid this, AMPS provides a configuration API for its internal memory management. This API would free the size-segregated free lists when the total size of the buffers in these lists reaches a certain threshold. This threshold is specified in percentage of the total available physical memory. Of course, the application could additionally perform its own admission control by limiting the number of concurrent sessions based on some application specific criteria. Application could register an event handler for the internal AMPS event that fires when the segregated lists (memory usage) reaches a certain threshold of the available memory. Figure 5a below shows the memory management object. It contains a linked list of buffers of a particular size. It also contains other state information including the size of the buffer, total bytes allocated, pointer to head of list, current active buffer etc. Figure 5b shows the internal structure of a single buffer of the linked list. The pointer moves down towards the end of the buffer with each allocation. Figure 6 shows the buffer cache with linked lists of different sizes. These sizes are a power of 2 starting from 1K bytes.

Scalability, clustering and distribution One of the major concerns of server application developers is how the system would scale as the load increases. Ideally, the system should seamlessly scale by adding more

hardware i.e. computation power and memory to the system. This means that if a server running on a single machine becomes congested, adding another machine should transparently increase the system throughput two-fold. Considering the AMPS design so far, scalability of a server can be achieved as follows: The CPU agents and I/O agents as described earlier can be easily distributed across different machines since both types of agents communicate with the main event processor thread via messages. There is no dependency or requirement of being in the same address space with the main application. If the agents can communicate with the main application via inter-thread communication mechanisms, they could communicate via inter-process mechanisms as well. If the communicating processes are running on different machines, they could communicate via TCP/IP protocol as well. An application developer can use this idea to distribute several CPU and I/O agents on different stand-alone machines. The main application running the event loop would transparently generate events for CPU and I/O agents as before. The registered event handler would act as a dispatcher as before, and send the event to one instance out of possibly several instances of agents, running on different machines, over a TCP connection. The dispatcher may select an instance of an agent based on some load balancing criteria, and would thus have to keep state about the health and load on different agent instances. It must be noted that the protocol session and other global state would still be maintained at only place i.e. the main application. The machines running agents would be executing a simple application only containing an event loop and one or more CPU or I/O agents, each consisting of a dispatcher and a thread pool. The main loop of this simple application would generate an internal event for its local dispatcher when an incoming event arrives. The local dispatcher would distribute the events load to its local thread pool. The rest of the operation would be the same as described before for CPU and I/O agents. This would result in transparently distributing the application across machines and creating a fully distributed application. AMPS currently recommends using TCP for inter-machine connections. This provides the benefits of reliability, congestion control and automatic keep-alives. References 1) The problem with threads, Edward A. Lee, UC Berkeley Technical Report, http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html January 2006 2) Accept()able Strategies for Improving Web Server Performance, Tim Brecht, David Pariag, Louay Gammo, Usenix 2004 3) I/O Multiplexing & Scalable Socket Servers, Ian Barile, Dr. Dobbs Journal February 2004 4) Event driven programming for robust software, Frank Dabek, Nickolai Zeldovich, Frans Kaashoek, David Mazieres, Robert Morris, Proceedings of the 10th ACM SIGOPS European Workshop 2002

5) SEDA: An Architecture for Well-Conditioned, Scalable Internet Services, Matt Welsh, David Culler, and Eric Brewer. In Proceedings of the Eighteenth Symposium on Operating Systems Principles (SOSP-18), Banff, Canada, October, 2001 6) The Ninja Architecture for Robust Internet-Scale Systems and Services, Steven Gribble, Matt Welsh, Rob von Behren, Eric A. Brewer, David Culler, N. Borisov, S. Czerwinski, Gummadi, J. Hill, A. Joseph, R.H. Katz, Z.M. Mao, S. Ross, B. Zhao, http://ninja.cs.berkeley.edu 7) Flash: an efficient and portable web server, Vivek S. Pai, Peter Druschel, Willy Zwaenepoel, Proceedings of the USENIX 1999 Annual Technical Conference 8) Hashed and Hierarchical Timing Wheels: Efficient Data Structures for Implementing a Timer Facility, George Varghese, Tony Lauck, IEEE/ACM Transactions on Networking, 1996 9) Dynamic Storage Allocation: A Survey and Critical Review, Paul Wilson, Mark Johnstone, Michael Neely and David Boles, Proc. 1995 Intl Workshop on Memory Management, Kinross, Scotland, Sept, 1995

You might also like