This action might not be possible to undo. Are you sure you want to continue?
Platform as a Service Under the Hood Episodes 1-5
PaaS Under the Hood INTRODUCTION
Building a Platform as a Service (PaaS) is rewarding work. We get to make the life of a developer easier. PaaS helps developers deploy, scale, and manage their applications, without making developers hardcore systems administrators themselves.
As with many problems, the toughest part about managing applications in the cloud is actually not the building of the PaaS itself. The challenge lies in being able to scale the applications. To give you a sense of the complexity, each minute, millions of HTTP requests are routed through the platform. Not only does our PaaS collect millions of metrics, we also aggregate, process, and analyze the metrics and look for abnormal patterns. Apps are constantly deployed and migrated on our PaaS platform. For economies of scale, virtually all PaaS providers pack density onto their physical machines. How does a PaaS provider solve the following issues? • • • • • • How is application isolation accomplished? How does the platform handle data isolation? How does the platform deal with resource contention? How does the platform deploy and run apps efficiently? How does the platform provide security and resiliency? How does the platform handle the load from the millions of HTTP requests?
One key element is lightweight virtualization which is the use of virtual environments (called containers) to provide isolation characteristics comparable to full-blown virtual machines, but with much less overhead. In this area, the dotCloud platform relies on Linux Containers called LXCs. In the following 5 episodes, we will dive into some of the internals of the dotCloud platform or more specifically, the Linux kernel features used by dotCloud.
PaaS Under the Hood
Overview: PaaS Under the Hood
*Click any arrow to begin!
Namespaces provide isolation of customer applications from one another.
Episode 2 Episode 3
Control Groups (cgroups) ensure fair sharing of computing resources.
AUFS (AnotherUnionFS) provides fast provisioning while retaining full flexibility and ensuring disk and memory savings.
Episode 4 Episode 5
GRSEC is a security patch to the Linux kernel and helps detect and deter malicious code.
Open source Hipache is the distributing proxy that powers dotCloud’s routing layer.
you’ll uncover a world of wonder. We will expose the mechanics behind the kernel-level virtualization and high-throughput network routing.” dotcloud. How do we accomplish such a feat? In this eBook.PaaS Under the Hood BACK4 Episode 1: Kernel Namespaces Episode 1: Kernel Namespaces Simplifying complexity takes a lot of work. we turn highly complex processes such as deploying and scaling web applications in the cloud and make them appear as simple workflows to developers and DevOps. A developer once said. At dotCloud. we will show you how dotCloud works under the hood. We will expose other technologies such as metrics collection and memory optimization in later eBooks.com . “Diving into the inner workings of a PaaS is like going Disneyland.
Although cgroups are useful to Linux Containers (we will cover cgroups more thoroughly in Episode 2). Top-level pid namespace can see all processes running in all namespaces with different PIDs. . For example. A “child” namespace cannot perform any actions on its “parent”. This isolation is the real magic behind Linux Containers! There are five Namespaces.PaaS Under the Hood Episode 1: Namespaces Episode 1: Kernel Namespaces 5 Each time a new Linux Container (LXC) is created. The pid namespace The pid namespace is the most useful technology for basic isolation. Namespace provides an even more vital function to the Linux Containers. A process can have more than 2 PIDs if there are more than two levels of hierarchy in the namespaces. ipc.” It is easy to think that the container relies on the control groups. net. mnt. for example. one in its own namespace and a different PID in its parent namespace. There are some principles about the pid namespace as follows: • Each pid namespace has its own “PID 1” init-like process • Processes residing in a namespace cannot affect processes residing in a parent or sibling namespaces with system calls like kill or ptrace because process ids are only meaningful inside a given namespace • If a pseudo-filesystem like proc is mounted by a process within a pid namespace. each covering a different resource: pid. a new container named “sanfrancisco” is filed under the directory “/cgroup/sanfrancisco. A “parent” namespace can see and implement actions on the “child” namespaces. Each pid namespace has it own numbering process. and uts. Different pid namespaces form a hierarchy with the kernel which keeps track of all the namespaces. Namespaces isolate the resources of processes. the name of the container is filed under the /cgroup directory. it will only show the processes belonging to the namespace • Numbering is different in each namespace which means that a process in a child namespace can have multiple PIDs.
as well as a special interface on one end. generally named eth0.0. each in their own pid and own net namespace. and will bear a poetic name like veth42xyz0.k.0.PaaS Under the Hood The net namespace Episode 1: Kernel Namespaces 6 With the pid namespace. When your Apache webserver process binds to INADDR_ANY and port 80 within its namespace *:80 within its namespace. with their default configuration listening on port 80 and each will remain individually addressable. A typical container will have its own loopback interface (lo). you could use the net namespace. it will only receive connections directed to the IP addresses and interfaces of its namespace. This is similar to the Xen networking model.0. Even lo. etc. It is then possible to put those special interfaces together within an Ethernet bridge (to achieve switching between containers). which has been designed for networking.1. The other end of the special interface will be in the “original” namespace. you can start processes in multiple isolated environments called “containers”.0. 0. Each net namespace has its own routing table. To configure your instances of Apache webserver to listen on different ports. What if you need to run separate instances of Apache webserver in each container? Generally only one process can listen to port 80/TCP at a time. . and its own iptables chains and rules. which will appear in two different net namespaces and allow one of the two net namespaces to talk to the outside world. a.a. or route packets between them. can be different in each different net namespace. the loopback interface supporting 127.0. It is even possible to create a pair of special interfaces. That allows you to run multiple Apache instances. Each different net namespace can have different network interfaces. Each net namespace has its own local meaning for INADDR_ANY.
and superseded by POSIX semaphores. and see only those mount points. Nonetheless. unless you’ve passed UNIX 101 when engineering schools still taught classes on IPC (InterProcess Communication). some programs such as PostgreSQL.PaaS Under the Hood The ipc namespace Episode 1: Kernel Namespaces 7 The ipc namespace won’t appeal to many of you. . Inspecting /proc/mounts in a container will show the mount points of all containers. However. The app doesn’t know anything about other containers! Meet the ipc namespace. What’s the connection with namespaces? Each IPC resource is accessed through a globally unique 32-bit ID. and shared memory segments. it may sound useful. While still supported by virtually every UNIX flavors. Processes living in different mnt namespaces can see different sets of mounted file systems and different root directories. At first impression. it will be accessible only to those processes within that namespace. with their path correctly correlated to the actual root of the namespace. If a file system is mounted in an mnt namespace. As its name implies. POSIX message queues. IPC provides semaphores. an application could be surprised if it failed to access a given resource because it has already been claimed by another process in a different container. The mnt namespace takes the chroot concept even further. those features are considered by many as obsolete. Seeing the path for the global namespace may confuse some applications that rely on the paths in the local namespace /proc/mounts. Processes within a given ipc namespace cannot access (or even see) the IPC resources living in other ipc namespaces. container C1 won’t be able to access or see container C2’s file system. The mnt namespace chroot is a mechanism to sandbox a process (and its children) within a given directory. While IPC implements permissions on the resource itself. Also. It will not be visible for processes in other namespaces. since the mnt namespace allows you to sandbox each container within its own directory. which can give out some hints about the layout of your system. allowing each container to have its own mount points. and mmap. message queues. is this really useful after all? If each container is chroot’ed in a different directory. And now you can safely run a PostgreSQL instance in each container without the fear of IPC key collisions. the mnt namespace deals with mount points. right? There are downsides. The mnt namespace makes the situation much cleaner. those mountpoints will be relative to the original namespace. for example. hidden from other containers. still use IPC.
The uts namespace addresses this issue by giving each uts namespace a different hostname.. This presents a problem that we will address in the next paragraph. and uts namespaces but not for mnt and pid namespace. Also. mnt. the uts namespace will only change the hostname for processes running in the same namespace. Its network interfaces that include the special pair of interfaces to talk with the outside world are configured. only for ones up to kernel 3. net. These are the series of steps that take place when creating a new container. network interfaces. Not all namespaces can be retained. and uts namespaces. you want those resources to survive after the termination of the last process of the namespace.. If. for some reason. you can use mount --bind to retain the namespace for future use. There is support for ipc. net. and changing the hostname through the sethostname system call.PaaS Under the Hood Episode 1: Kernel Namespaces 8 The uts namespace Finally. allowing you to specify whether the new process should run within its own pid. This system call supports a number of flags. the uts namespace deals with one important detail in that the hostname can be “seen” by a group of processes. Creating namespaces Namespace creation is achieved with the clone system call. the associated resources (IPC. . because each namespace is stored in a special file in /proc/$PID/ns. When the last process within a namespace exits. A new process starts with new namespaces created.4. ipc. It then executes an init-like process.) are automatically reclaimed.
• You can run sshd in your containers. Combining the necessary patches can be fairly tricky. Only a patched kernel will allow you to attach to existing mnt and pid namespaces.stderr from init in your containers. the control socket. • If you want something simpler than SSH (or something different than SSH to avoid interferences with sshd custom configurations). you might want to get rid of sshd.PaaS Under the Hood Attaching to Existing Namespaces Episode 1: Kernel Namespaces 9 It is also possible to get into or “enter” a namespace. and pre-authorize a special SSH key to execute your commands. An example would be to run socat TCP-LISTEN:222. net. without relying on scripts inside the container • Running arbitrary commands to retrieve information about the container (this can be done by executing netstat) • Obtaining a shell within a container Attaching a process to existing namespaces requires two things: • The setns system call (which exists only since kernel 3. it will retain its open file descriptors – and therefore.0. But if sshd crashes. AUFS and GRSEC will be covered in Episodes 3 & 4 respectively. • An even better solution is to embed this “control channel” within your init process. To avoid running an overly patched kernel. there are three suggested workarounds. Here are some use cases for assigning your own namespaces • Setting up network interfaces “from the outside”. or is stopped (either intentionally or by accident). and uts namespaces were supported /proc/$PID/ns and that mnt and pid namespaces were not supported. or with patches for older kernels) • The namespace must appear in /proc/$PID/ns We mentioned in previous paragraphs that only ipc. because it involves resolving conflicts between AUFS and GRSEC. Make sure that port 222/tcp is configured correctly and firewalled within. If the latter is your main concern. by attaching a process to an existing namespace. the init process could setup a UNIX socket on a path located outside the container root directory. you can open a backdoor.reuseaddr EXEC:/bin/bash. This is one of the easiest solutions to implement. you can start the SSH service from inetd or a similar service. . Before changing its root directory.fork. Or. you can run a low profile SSH server like dropbear. Also. you may be locked out of the container. if you want to squeeze the memory footprint of your containers as much as possible. When it will change its root directory.
and orchestrate containers. We found this approach to be the most convenient and reliable way to deploy.PaaS Under the Hood How dotCloud uses namespaces Episode 1: Kernel Namespaces 10 In previous releases. . From the beginning. we still made use of namespaces to isolate applications from each other even though we have stripped down the vanilla LXC containers. which made implicit use of namespaces. we deployed kernel patches that allowed us to attach arbitrary processes into existing namespaces. control. As the dotCloud platform evolved. the dotCloud platform used vanilla LXCs (Linux Containers).
ulimit and setrlimit set resource limits for a single process. Conceptually. are a set of mechanisms to measure and limit resource usage for groups of processes.PaaS Under the Hood BACK11 Episode 2: cgroups Episode 2: cgroups Control groups. dotcloud. cgroups allow you to set resource limits for groups of processes.com . or “cgroups”. it works somewhat like the ulimit shell command or the setrlimit system call.
When a process is created. What can be Controlled? Many things! We’ll highlight the most useful ones here. If the init process of a container has been placed in a control group. or independently. cpuacct. Alternatively. The subsystems can be used together.stat. To explain the latter case more fully. creating a new group named polkadot is as easy as mkdir /cgroup/polkadot. You can then move one (or many) processes into the control group by writing their PID to the right control file. and the file names are prefixed with the subsystem name. Otherwise rmdir will fail since it is like trying to remove a non-empty directory. all the processes of the container will be also be in the same control group. the files cpuacct.usage_percpu are the interface for the cpuacct subsystem. Destroying a control group is as easy as rmdir /cgroup/polkadot. for example. it instantly gets populated with many (pseudo) files to manipulate the control group. cpuacct. you can decide that each control group will have limits and counters for all the subsystems. For instance. at least the ones we think are the most useful. a subsystem can have a process in the polkadot control group for memory control. Assuming that it has been mounted on /cgroup. In other words. When you create this (pseudo) directory.PaaS Under the Hood Pseudo-FS Interface Episode 2: cgroups 12 The easiest way to manipulate control groups is through the cgroup file system. Each subsystem is responsible for a set of files in /cgroup/polkadot. The available subsystems will be detailed in the next paragraph. Technically.usage. . echo 4242 > /cgroup/polkadot/tasks. it will be in the same group as its parent. a process in the bluesuedeshoe control group for CPU control such that polkadot and bluesuedeshoe are in completely separated namespaces. However the processes within the cgroup have to be moved to other groups first. control groups are split into many subsystems. each subsystem can have different control groups.
Remember that those shares are just relative numbers. CPU You might already be familiar with scheduler priorities. Complex queries would consume a lot of memory but. and with the nice and renice commands. your whole database (or at least. the end result will be exactly the same. This control group also gives statistics incpu. and the kernel will make sure that each group of processes gets access to the CPU in proportion of the number of shares you gave it. but also for the memory used for caching purposes. . for this example. Setting the number of shares is as simple as echo 250 > /cgroup/polkadot/cpu.shares. inspect the pseudo-filememory. with databases. To perform optimally. To check the current usage for a cgroup. You can implement a memory limit for a process inside a cgroup that can easily be done by using echo 1000000000 > /cgroup/polkadot/memory.PaaS Under the Hood Memory Episode 2: cgroups 13 You can limit the amount of RAM and swap space that can be used by a group of processes. we are not performing complex queries.usage_in_bytes in the cgroup directory. control groups will let you define the amount of CPU that should be shared by a group of processes. for instance. If you multiply everyone’s share by 10.stat. Once again. You can give each cgroup a relative number of CPU shares. This can make a big difference. or RSS. You can gather very detailed (and very useful) information using memory. A database typically consumes very little memory for processing but consumes a large chunk of cache. because traditional tools such as ps or analysis of /proc do not have a way to identify the cache memory usage incurred by specific processes.limit_in_bytes (that will be rounded to a page size). instead of by a single process. It accounts for the memory used by the processes for their private use such as their Resident Set Size. your “active set” of data that you refer to the most often) should fit into memory. This is actually quite powerful.stat.
This is true even on systems where the number of CPU cycles can change (such as tickless systems.PaaS Under the Hood CPU Sets Episode 2: cgroups 14 This is different from the cpu controller. It’s like a highway that offers a guarantee that you be able to go above a given speed. the vast majority of servers. This is a useful technology. It has a fixed number of CPU cycles every second. Block I/O The blkio controller provides a lot of information about the disk accesses (or technically. because I/O resources are much harder to share than CPU or RAM. A hard disk with a 10ms average seek time will be able to process about 100 requests of 4 kB per second.g. and each bank is tied to a specific CPU . Not very helpful in practicality. the cpuset control group lets you define which processes can use which CPU. known. This works for a group of processes or CPUs too. But as soon as the application performs a task that requires a lot of scattered.or set of CPUs in a multi-core system. The average throughput (measured in IOPS. except that this speed is 5 mph. because the kernel will slice the CPU time in shares of e. or virtual machines). 1 millisecond. known. However. This does not present an issue. but if the requests are sequential. but this guaranteed performance is so low that it is not helpful. is it? . and fixed amount of RAM.e. random I/O operations. Binding a process (or group of processes) to a specific CPU or to a specific group can also reduce the overhead when a process is scheduled to run on a CPU. the memory is split in multiple memory banks. A system has a given.. desktop & laptop computers. while accessing RAM tied to another CPU. and even phones today!). so that you can use a cpuset to bind a process and its memory to a specific CPU to avoid the penalty. The system can give you some guaranteed performance. On systems with Non Uniform Memory Access (NUMA). or the predictions aren’t very useful. and there is a given. and they might also run faster because there will be less thrashing at the level of the CPU cache. typical desktop hard drives can easily sustain 80 MB/s transfer rates – which means 20000 requests of 4 kB per second. I/O bandwidth can be unpredictable. There is a penalty to pay for accessing RAM that is tied to another CPU. In systems with multiple CPUs (i. by the way. I/O Operations Per Second) will be somewhere between those two extremes. block devices’ requests) performed by a group of processes. That is exactly the problem over at AWS EBS. the performance will drop – dramatically. and fixed number of milliseconds in every second obviously. Those processes will receive a fixed amount of CPU cycles. This can be useful to reserve a full CPU to a given process or group of processes.
and can therefore sustain random I/O as fast as sequential I/O. but erase. When it needs to erase. The controller lets you set limits. This is because read and write operations are fast. its quota can be adapted to reduce impact on other apps. An example of this use case would be to use SSD to manage video on demand for hundreds of HD channels simultaneously. is slow. which can be expressed in number of operations and/or bytes per second. The available throughput is therefore predictably good. once an I/O intensive app has been identified. The disk will sustain the write throughput until it has written every block once. For instance. there are some workloads that can cause problems. .PaaS Under the Hood Episode 2: cgroups 15 That’s why SSD storage is becoming increasingly popular. It also allows for different limits for read and write operations. what’s the purpose of the blkio controller in a PaaS environment? The blkio controller metrics will help detect applications that are putting an excessive strain on the I/O subsystem. which must be performed at some point before write. writing and rewriting a whole disk will cause performance to drop dramatically. Going back to dotCloud. the performance will drop to below acceptable levels. under any given load. It allows you to set some thresholds that no single app can significantly degrade performance for other apps. SSD has virtually no seek time. Furthermore. Actually.
to limit and/or meter appropriately the resources of each user. But there are many other uses for cgroups. and CPU shares. swap usage. when you use vertical scaling on dotCloud. even when they use the “double-fork” technique to detach from their parent and re-attach to init. We also use cgroups to allocate resource quotas for each container. and therefore of each unit of each service for each application. For instance. since it is very easy to map each container to a cgroup. We will be writing a more in-depth article on metrics collection system in the future. along with our in-house lxc plugin. cgroups are convenient for containers. It is also possible to run a system-wide daemon to automatically classify processes into cgroups.PaaS Under the Hood It’s Not Only for Containers Episode 2: cgroups 16 As we mentioned. This can be particularly useful on multi-user systems. we can meter very accurately the resource usage of each container. you are actually setting limits for memory. It also allows finegrained tracking and control of the resource used by each service. Our metrics collection system uses collectd. . This allows you to keep track of all the subprocesses started by a given service. Metrics are streamed to a custom storage cluster. dotCloud & Control Groups Thanks to cgroups. or to run some specific programs in a special cgroup—when you know that those programs are prone to excessive resource use. The systemd service manager is able to put each service in a different cgroup. and can be queried and streamed by the rest of the platform using our ZeroRPC protocol.
com .PaaS Under the Hood BACK17 Episode 3: AUFS Episode 3: AUFS AUFS (which initially stood for Another Union File System) provides fast provisioning while retaining full flexibility and ensuring disk and memory savings dotcloud.
Java. If the entire image had to be cloned each time a dotCloud application is deployed. which merges two directory hierarchies together. it is actually more than that. The resulting file system looks like the large read-only one. Storage Savings Let’s assume that the base image takes up 1 GB of disk space. since we’re talking about a full server file system. the copy would put a significant I/O load on the disk. Faster Deployments Copying the whole base image would not only use up precious disk space. In reality. and so on. AUFS therefore lets us save on storage costs because it is typically using less than 1 MB of disk space. LiveCDs or bootable USBs are common examples of this use case. unique to each app. Ruby. up to a minute or so depending on the disk speed. and virtually no I/O at all. AUFS offers a much better solution when compared to copying an entire image every time. we use AUFS to combine a large. creating a new “pseudo-image” using AUFS takes a fraction of a second. On the other hand. Perl. it would use 1 GB of disk space for each new cloned deployment. read-only file system containing a ready-to-run system image under a writeable layer. Also. containing everything a dotCloud app could potentially need such as Python. . On the dotCloud platform. C compiler and libraries.PaaS Under the Hood Episode 3: AUFS 18 AUFS is a union file system. except that you can now write on it anywhere and store just the changed files. but it would also take time. AUFS allows us to have a common base image for all applications and a separate read-write layer.
PaaS Under the Hood Better Memory Usage Episode 3: AUFS 19 Virtually all operating systems use a feature called buffer cache to make disk access faster. and a lot of common infrastructures. The changes will be immediately visible in the AUFS mount points using the base image. like crond. because it has to access the disk even to run simple commands. the /bin/sh standard shell. If each app were running from its own copy. just like on a typical single server environment. lets you do whatever you want with the base image. This will typically save tens of MB for each app. and the Linux kernel therefore knows how to load them only once in memory.. 100x or up to 1000x slower.. For example. within the buffer cache. Every single application will load from disk a number of common files and components such as the libc standard library. just to name a few. for example. those common files are in the base image. the local Mail Transfer Agent. when listing your files with ls ! As we will see. Additionally. with those systems. identical copies of those common files would be present multiple times in memory. AUFS. your system could run at least 10x. you might argue that snapshots. Python applications will load a copy of the Python interpreter every time. Easier Upgrades If you are familiar with storage technology. That’s true. Without it. . even while the applications are running. sshd. it is not possible to update the base image. AUFS also lets us rack in big savings on this buffer cache. However. It means that it is easy to do software upgrades. and copy-on-write devices already have those advantages mentioned above. except that you can upgrade thousands of servers all at once. on the other hand. and have the changes reflected in the lightweight “clones” such as in the snapshots. Using AUFS. all applications of the same type will load the same files.
X terminals. you don’t need to write anywhere else except places like /tmp.. servers. one with only a shared read-only root file system with distinct writable mount points. Other Union File Systems We considered many file systems with similar properties outlined in the above. just apt-get install whatever you need. install whatever you need into a specific writable mount point like /home. allowing read-write access through ad hoc mount points. /var/lock. . since the entire “read only” root file system is still writeable.. That means a manual install and potentially introducing side effects or conflicts with existing previously installed versions With AUFS. The read-only base file system won’t be affected because all the changes will be written onto your own private layer. We opted for AUFS because for what we need to do at dotCloud. On a system without AUFS. Let’s suppose you need an extra package. • Either you upgrade the read-only base image (and potentially affect all other users of the image) • Alternatively. you have two alternatives. AUFS offers many advantages. /tmp or equivalent. Use Cases for AUFS Because it allows for arbitrary changes to the file system. skilled UNIX systems administrators have been deploying machines (workstations. After all. in addition to AUFS. or maybe you want to upgrade the version of Python or Ruby. with some clever configuration and tuning.) with a readonly root file system. and the formers can even use a tmpfs mount.PaaS Under the Hood Allows Arbitrary Changes Episode 3: AUFS 20 All those things can also be done without AUFS. we believe that it is the most mature and stable solution at the time of our evaluation. and of course /home. /var/run. The latter can be a traditional read-write file system. For a decade.
If you need a particular library that is not included in our base image but the library does exist in the Ubuntu package repository. it had significant issues. We are currently using AUFS 3. We were able to leverage the flexibility provided by AUFS without the downside. However.yml file. then installing it in your service can be a breeze! Use the systempackages option in your dotcloud. PostgreSQL. the main feature that benefits from AUFS is our custom package installation system. Redis and on the home directory where the application code is executed. Mounting the read-write volumes into the data directories gave us the required stability. notably with mmap. AUFS allows the package to be installed into your service. When we were using AUFS 2. the other union file systems performed even worse for that specific issue. MongoDB. without ever touching the base image used by other applications. We worked around those issues by mounting some read-write volumes at strategic places such as into the data directories of MySQL.PaaS Under the Hood Caveats Episode 3: AUFS 21 However. AUFS at dotCloud Technically. technology is constantly evolving and no solution is ever a perfect match for our changing requirements. .
com . dotcloud. Security features in GRSEC help detect and deter malicious code.PaaS Under the Hood BACK22 Episode 4: GRSEC Episode 4: GRSEC GRSEC is a security patch to the Linux kernel.
First. program code must be loaded in an area that is marked by the memory management unit as being read-only. detect suspicious activity such as people looking for new exploits and/or known system vulnerabilities. At this point. Second. the heap and the stack regions should be marked as non-executable at the hardware level. it would be much more difficult for an attacker to exploit the system. there is no memory that is both executable and writable. providing strong security features that prevent many kinds of attacks (or “exploits”). they’re supposed to contain data structures. However. the heap and the stack must be marked as non-executable. and cause a jump to an arbitrary address (when a function returns) • The stack is altered to introduce some malicious code • A pointer to this malicious code is placed on the stack as well • The bug is triggered. . effectively preventing accidental or intentional execution of code located in there. Consider the following example. and return addresses but no opcode should be in there. The attacker would have to locate his malicious code before he can jump to the code in memory. Self-modifying code is sometimes referred to as polymorphic code.PaaS Under the Hood Episode 4: GRSEC 23 GRSEC is a fairly large patch for the Linux kernel. After all. so our goal is to provide an overview of the relevant features to dotCloud. There are many features in GRSEC. function parameters. This prevents code from modifying itself. Randomize Address Space Many exploits rely on the fact that the base address for the heap or the stack is always the same. There are legitimate use cases for polymorphic code. On architectures supporting it. this is a classic scenario for an attack on a remote service: • A bug is found in the service. and can be used to alter the stack. it is more often associated with dubious intentions. Some index is not checked properly. Prevent Execution of Arbitrary Code There are two steps to make sure that arbitrary code can’t make it inside a running program. The service jumps to the malicious code and executes it If the address space of the stack is randomized.
the failure will result in the process having to do a segmentation violation. GRSEC logs will record how they’ve exploited the system. . and what can be done about it? The most common case is on-the-fly code generation for optimization purposes. If you’re not familiar with those concepts. and then be killed by SIGSEGV. This is not a comment about the quality of anyone’s code. This allows you to lock out malicious users. GRSEC can be useful in Forensics in case someone does successfully breach the system. But if you see the padlock full of dents. you can draw upon an analogy in which you observe many scratches around a padlock. Segmentation Fault in the kernel log. Knowing how someone exploited the system can be a valuable tool for the person who is trying to close the security gap. Audit Suspicious Activity Another interesting security feature of GRSEC is the ability to log some specific events. If the system detects many different programs started by the same user that are all being killed in the same way. It’s more about the number of users in the Java community and their scrutiny on the JVM. the JVM itself. it is possible to make a record each time a process is terminated by SIGSEGV. or.g. Often. a. monitor them closely to see what they’re doing. Any C programmer will tell you that there are legitimate cases where programs are terminated by SIGSEGV. not in your program.a. The kernel logs can then be analyzed in real time.PaaS Under the Hood Episode 4: GRSEC 24 We mentioned that there were some legitimate uses for memory regions with both write and exec permissions. The good news is that GRSEC lets you flag some specific executables and allows them to write to their code region or execute their data region. To exploit a bug. but there are benefits. and suspicious patterns can be detected. This use case is applicable to those using Java and JIT (Just-In-Time) compiler. Bugs in the JVM are likely to be found and fixed much faster than bugs in your own program. What’s the point? Potential attackers will likely run a number of known exploits in an attempt to gain escalated privileges. This reduces the security for those specific processes. you can bet that someone is trying to pick it! There are many other similar events that are logged by GRSEC. When does that happen. A few scratches on the surface won’t mean anything.k. alternatively. For instance. Many of the exploits will hopefully fail. then it is telltale sign that someone is trying to break into the system. there has to be a bug in e.
Each service runs in its own container. which will “constify” some kernel structures. It will automatically add the const keyword to all structures containing only function pointers (unless they have a special “non const” marker to evade the process).. We believe that combining them together can provide a more than adequate level of security for massively scaled. Accordingly. dotCloud has built-in additional layers of security. you can check GRSEC’s website. non-privileged UID. • Get the kernelsources • Apply the GRSEC patch set • Run make menuconfig • Navigate to the compilation options related to GRSEC Almost each feature of GRSEC can be enabled/disabled at compilation time. This not only reduces exposure to attacks. “No root access” means that users cannot SSH as root. It enables a compiler plugin. Why the emphasis on function tables? Because if they can be breached. well-audited programs. The benefits of container isolation were explained in Episode 2 on cgroups. but also later when the kernel is running. All processes run under a regular.The rationale is to make sure that any code that manipulates a function table will be closely audited before the function table is marked “non const”. and therefore will be listed there. recall the technique explained in the beginning of Episode 4! Marking those data structures as const helps at compile time.PaaS Under the Hood Compile-time Security Features Episode 4: GRSEC 25 GRSEC also plays its part during the kernel compilation. We do not allow dotCloud users to have root access. like ping. Furthermore. you can follow these four steps to get a full listing of all the GRSEC features and descriptions on each feature. but can also make it harder for successful attackers to cover up their tracks by hijacking existing function tables. SUID binaries are restricted to a set of wellknown. If you want to learn about other features. cannot login as root. In other words. In addition to GRSEC. instead of being mutable by default unless marked const.. . this is just a quick overview. unless specified otherwise. function tables are now const by default. multi-tenant platforms. attempts to modify function tables will be detected at compile-time. If you want to quench your thirst for technical details. because those data structures will be laid out in a memory region which will be made read-only by the memory management unit. they are a convenient way for a potential attacker to jump to arbitrary code. . The Help provided with each compilation option is fairly informative. and cannot get a root shell through sudo. Each of those security layers is strong.And Many More As told in the introduction.
some of them running more than one thousand containers.PaaS Under the Hood BACK26 Episode 5: Distributed Routing The dotCloud platform is powered by hundreds of servers.com . The majority of these containers are HTTP servers and they handle millions of HTTP requests every day to power the applications hosted on our platform. Episode 5: Distributed Routing dotcloud.
we will see which mechanisms are used. to ensure optimal availability and latency. which is a cluster of identical HTTP load balancers. 3 machines are enough to deal with the traffic. in the next sections. the configuration of those load balancers has to be updated. Each time we create. we scale up to 6. When the load is low. 10. working in tandem with a Redis cache. Each update done through the API propagates through the platform. as we call it. The “master source” for all the configuration is stored within a Riak cluster. the gateways also deal with the load balancing and failover. or even more machines. the gateways also forward HTTP logs to be processed by the metrics cluster. The gateways parse HTTP requests. or delete an application on dotCloud. When spikes or DoS attacks happen. When there are multiple backends for a single service. The configuration is modified using basic commands: • Create a HTTP entry • Add/remove a frontend (virtual host) • Add/remove a backend (container) The commands are passed through a ZeroRPC API. All HTTP requests are bound to the “HTTP routing layer”.g. and route them to the appropriate backends. scale).PaaS Under the Hood HTTP Routing Layer Episode 5: Distributed Routing 27 All the HTTP traffic is bound to a group of special machines called the “gateways”. runs on an elastic number of dedicated machines. visitors (technically HTTP clients) HTTP routing layer (load balancers) dotCloud platform dotCloud app cluster This “HTTP routing layer”. . update (e. Last but not least.
and the right solution is not always the ideal one. and provide all the necessary information. That’s why the first iteration of our routing layer had some shortcomings. This setup had two issues: • Nginx does not support the WebSocket protocol. but it required deploying more and more powerful instances as the number of apps increased. Although Nginx was still fast and efficient. The next update would contain the full configuration. Nginx also handles load balancing and fail-over well. but it’s the one that allowed us to ship on time. When a backend server dies. But it has functioned properly up to support tens of thousands of apps. Then. Nginx is well designed. there was no special case to handle. efficient format. . when a load balancer lost a few configuration messages. Each modification to an app caused the central “vhosts” service to push the full configuration to all the load balancers. the load balancers started to expend a significant amount of CPU time to reload Nginx configurations. and inform Nginx to reload this configuration. The configuration was transmitted using a compressed. which was one of the top features requested by our users at that time • Nginx has no support for dynamic reconfiguration. and will re-add it to the pool once it has fixed itself. even when loading the new configurations. as we will see. removes it from the pool. Nginx detects it.PaaS Under the Hood Version 1: Nginx + ZeroRPC Episode 5: Distributed Routing 28 As you probably know. it can still serve requests along with the old one which meant that no HTTP request is lost during the configuration update. a start-up must be lean. using ZeroRPC. Nginx powered the first version of dotCloud’s routing layer. agile. It also needs to be pragmatic. which means that each configuration update requires the whole configuration file to be regenerated and reloaded At some point. Sending differential updates would have been better. each load balancer would transform this abstract configuration into the Nginx configuration file. as the number of apps grew. Obviously. periodically tries it again. But at least. and many other things. we had to find a more dynamic alternative. the size of the configuration grew as well. There was no significant impact on running applications.
and running it will just improve the dead backend detection. In other words. it will immediately be re-flagged as dead. However. by doing the HTTP equivalent of a simple “ping”.PaaS Under the Hood Version 3: Active Health Checks Episode 5: Distributed Routing 30 Hipache has a simple embedded health-check system. but it has three caveats: • If a backend is frozen. HTTP 5xx responses. and this was an excellent occasion to do some comparative benchmarks (we will be publishing the benchmarks in future eBooks) The active health checker is completely optional. then it goes back to normal state. we implemented active health checks. . we considered multiple solutions: Node. stand-alone. and you can plug it on top of an existing Hipache installation without modifying Hipache configuration: it will detect and update Hipache configuration directly through the Redis used by Hipache itself. During the 30 seconds. This mechanism is simple enough and it works. Python+gevent. etc. until it gets marked as dead • When a backend is repaired. it is marked as dead. And finally decided to roll it with the Go language. just like Hipache. The HTTP “pings” can be sent every few seconds. meaning that it will be much faster to detect when a backend changes. When a request fails because of some backend issue (TCP errors.). if it is faulty.js. As soon as it starts replying again. Go Lang was chosen for several reasons as follows: • The health checker is massively concurrent (hundreds. it is marked as live. the backend is flagged as being dead. and even thousands of HTTP connections can be “in flight” at a given time) • Go programs can be compiled and deployed as a single. hchecker is open source. no request is sent to the backend. binary • We have other tools doing massively concurrent queries. Twisted. it can take up to 30 seconds to mark it live again • A backend which is permanently dead will still receive a few requests every 30 seconds To address those three problems. You don’t need it to run Hipache. we will still send requests to it. And of course. and remains in this state for 30 seconds. The health checker permanently monitors the state of the backends. To implement the active health checker. As soon as a backend stops replying correctly to ping requests. it gets along perfectly fine with Hipache embedded passive health-checking system.
In fact. The benchmarks have to be refined. while the code is still experimental. Recently we did some research and tests to see if there was some way to implement dynamic routing with Nginx. It has improved a lot since then and it may be an ideal candidate. . Some preliminary benchmarks show that under high load.000-100. while the routing logic is all in Lua. called hipache-nginx. using configuration rules stored in Redis. We used the excellent lua-resty-redis module to talk to Redis from Nginx. using the format currently used by Hipache. we aimed for an even higher goal. This would allow us to re-use many components such as the Redis feeder and the active health checker that uses the same configuration format. but it appears that hipache-nginx can deliver the same performance as hipache-nodejs with 10x fewer resources. it shows that there is plenty of room for improvement in the current HTTP routing layer. by using the same Redis configuration format. We wanted to route requests with Nginx. we’re constantly trying to find ways to make it better all the time. This open source project.000 requests per second. we looked at the Nginx Lua module. it is still worth investigating.js. hipache-nginx can be 10x faster than the original Hipache in Node.PaaS Under the Hood What’s next? Episode 5: Distributed Routing 31 Since this HTTP routing layer is a major part of the dotCloud infrastructure. So. We started an experimental project which lets Nginx mimic Hipache. Nginx deals with the request proxying. Guess what: we found something! Less than one year ago when we started to think about the design of Hipache and begin implementation. Even if it will probably have an affect on apps with 10.
Or. In other words.com . you could rely on an existing proven platform like dotCloud. protection against security threats and distributed routing. Of course. you may not choose to implement any of the specific technologies that we’ve implemented in dotCloud. rapid deployment. We aim to expose the underlying technologies that we’ve implemented that provide isolation between apps. you may want to become familiar with those types of technologies. if you are serious about building a robust platform. Join dotCloud’s Technical Community Sign up for your own account Join the technical discussions in our open forums Read our blog Have a technical question? Email us: support@dotcloud.PaaS Under the Hood 32 CONCLUSION As you can see. building a PaaS like dotCloud or Heroku involves specific knowledge about fundamental technologies. alternatively.
and documents the many ways to use dotCloud in articles. He cares for the servers powering dotCloud. Sam was part of the tiny team that shipped our first private beta in 2010. which is another way to say he sits in meetings so that the other engineers don’t have to. PaaS Under the Hood. and various other feats of technical wizardry. he manages our fast-growing engineering team. Since then. Sam supported Fortune 500s at Akamai. and wrote software for self-driving cars in a research lab at INRIA. supervised the deployment of fiber interconnects through the French subway. he has been instrumental in scaling the platform to tens of millions of unique visitors for tens of thousands of developers across the world. In a previous life. where he rotates between Ops. Today.PaaS Under the Hood 33 Author’s Biography Jérôme Petazzoni. he maintains several popular open source projects. When not sitting in a meeting. PaaS under the Hood. leaving his mark on every major feature and component along the way. Episode 5 As dotCloud’s first engineering hire. Episodes 1-4 Jérôme is a senior engineer at dotCloud. Support and Evangelist duties and has earned the nickname of “master Yoda”. He’s also an avid dotCloud power user who has deployed just about anything on dotCloud . including Hipache and Cirruxcache and other projects also ending in “-ache”. helps our users feel at home on the platform. Connect with Jérôme on Twitter! @jpetazzo Sam Alba. tutorials and sample applications. specialized in commando deployments of large-scale computer systems in bandwidth-constrained environments such as conference centers. as dotCloud’s first Director of Engineering. built a specialized GIS to visualize fiber infrastructure. built the web infrastructure at several startups. In a previous life he built and operated large scale Xen hosting back when EC2 was just the name of a plane. Follow Sam on Twitter @sam_alba .look for one of his many custom services on our Github repository.
This action might not be possible to undo. Are you sure you want to continue?