You are on page 1of 5

A deep dive into Linux namespaces

Ifeanyi Ubah KTH '17. SE and Distributed Systems Github

Overview

• What is a namespace?
• What is it good for?
• Unsharing is caring
• But like, what is a container though?
• Isolate
• Implementation

In this series of posts we will look closely at one of the main ingredients in a container -
Namespaces. In the process, we will create a simpler clone of the docker run command - our
very own program that will take as input a command (along with it’s arguments if any) and
spin up a container process to run it, isolated from the rest of the system similar to how you
would docker run it from an image.

What is a namespace?
A Linux namespace is an abstraction over resources in the operating system. We can think of
a namespace as a box. Inside this box are these system resources, which ones exactly depend
on the box’s (namespace’s) type. There are currently 7 types of namespaces Cgroup, IPC,
Network, Mount, PID, User, UTS.

For instance, the Network namespace encapsulates system resources related to networking
such as network interfaces (e.g wlan0, eth0), route tables etc, the Mount namespace
encapsulates files and directories in the system, PID contains process IDs and so on. So two
instances of a Network namespace A and B (corresponding to two boxes of the same type in our
analogy) can contain different resources - maybe A contains wlan0 while B contains eth0 and a
different route table copy.

Namespaces aren’t some addon feature or library that you need to apt install, they are
provided by the Linux kernel itself and already are a prerequisite to run any process on the
system. At any given moment, any process P belongs to exactly one instance of each
namespace type - so when it needs to say, update the route table on the system, Linux shows
it the copy of the route table of the namespace to which it belongs at that moment.

What is it good for?


Absolutely nothi… just kidding. One good thing with boxes is that you can add and remove
stuff from one box and it will not affect the content of other boxes. That’s the same idea here
with namespaces - a process P can go crazy and sudo rm -rf / but another process Q that
belongs to a different Mount namespace will be unaffected since they’re using distinct copies
of those files.

Note though that a resource encapsulated within a namespace doesn’t necessarily mean that
it’s a unique copy. In a number of cases, either by design or as a security hole, two or more
namespaces will contain the same copy, e.g of the same file, so that changes made to that file
in one Mount namespace will in fact be visible in all other Mount namespaces that also
reference it. For this reason, we will retire our box analogy here since an item cannot
simultaneously exist in two distinct boxes �.

Unsharing is caring
We can see the namespaces that a process belongs to! In typical Linux fashion, they’re
exposed as files under the directory /proc/$pid/ns for a given process with process id $pid:
$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 net -> net:[4026531957]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 pid -> pid:[4026531836]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 user -> user:[4026531837]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 uts -> uts:[4026531838]

You can open a second terminal and run the same command and it should give you the exact
same output - this is because, as we mentioned earlier, a process must belong to some
namespace and unless we explicitly specify which ones, Linux adds it as a member to the
default namespaces.

Let’s meddle in this a bit. In the second terminal we can run something like:
$ hostname
iffy
$ sudo unshare -u bash
$ ls -l /proc/$$/ns
lrwxrwxrwx 1 root root 0 May 18 13:04 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 May 18 13:04 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 May 18 13:04 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 May 18 13:04 net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 May 18 13:04 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 May 18 13:04 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 May 18 13:04 uts -> uts:[4026532474]
$ hostname
iffy
$ hostname coke
$ hostname
coke

The unshare command runs a program (optionally) in a new namespace. The -u flag tells it to
run bash in a new UTS namespace. Notice how our new bash process points to a different uts
file while all others remain the same.

Creating new namespaces usually requires superuser access. From now on, we
will assume that both unshare or our implementation are run with sudo.

One implication of what we just did is that we can now change the system’s hostname from
within our new bash process and it won’t affect any other process in the system. You can
verify this by running hostname in the first shell or a new one and seeing that the hostname
hasn’t changed there.

But like, what is a container though?


Hopefully, now you have some idea of what a namespace can do. You might guess that
containers are fundamentally ordinary processes with different namespaces from other
processes and you’d be correct. In fact a quote, unquote container doesn’t have to belong to a
unique namespace for each type - it can share some of them.

For instance, when you docker run --net=host redis, all you do is tell docker to not create a
new Network namespace for the redis process, and as we saw, Linux will add that process as a
member of the default Network namespace just like every other regular process. So the redis
process is exactly like everyone else from a networking perspective. Networking isn’t special
here,docker run let’s you do this customization for most namespaces. This begs the question
of what even is a container? Is a process that shares all but one namespace still a container?
¯\_(ツ)_/¯
Usually containers come with the notion of isolation, achieved through namespaces - the
smaller the number of namespaces and resources that a process shares, the more isolated
the process is and that’s all that really matters.

Isolate
In the remainder of this post, we will lay the ground work for our program that we will call
isolate. isolate takes a command as arguments and runs that command in a new process
isolated from the rest of the system and within its very own namespaces. In the coming
posts, we will look at adding support for individual namespaces when isolate spins up the
command process.

In terms of scope, we will focus on the User, Mount, PID and Network namespaces. The rest are
relatively trivial to implement once we’re done (in fact, we add UTS support in the initial
implementation here) and Cgroup for example is only interesting from a perspective that is
out of scope of this series (studying cgroups - the other ingredient in containers that is used
to control how much of a resource a process is allowed to use).

Namespaces can get complex real quick so there are lots of different paths we can take while
studying each namespace but we can’t take them all. We will only discuss the paths that are
relevant to the program that we’re building. Each post will start off with some experiments
on the namespace in question within a terminal in an attempt to understand the interactions
involved in setting up that namespace. After this we will already have an idea of what we
want to accomplish and will then follow up with a corresponding implementation in isolate.

To avoid bombarding the posts with code, we will not include things like helper
functions that are not necessary to understand the implementation. You can find
the full source code here on Github.

Implementation
The source code for this post can be found here.

Our isolate implementation will initially be a simple program that reads a command path
from stdin and clones a new process that executes the command with the specified
arguments. The cloned command process will run in its own UTS namespace just like we did
with unshare earlier. In later posts, we will see that namespaces do not necessarily work (or
even provide isolation) out of the box and we will need to do some setup after creating them
(but before executing the actual command) in order for the command to truly run in
isolation.

This namespace creation-setup combo will require some co-operation between the main
isolate process and the child command process. As a result, part of the ground work here
will be to setup a communication channel between both processes - we will use a Linux pipe
due to its simplicity given our use case.

We have three things to do:

1. Create the main isolate process that reads from stdin.


2. Clone a new process that will run the command in a new UTS namespace.
3. Set up a pipe so that the command process begins the command execution only after it
receives a signal from the main process that the namespace setup is done.

Here is the main process:


int main(int argc, char **argv)
{

struct params params;


memset(&params, 0, sizeof(struct params));
parse_args(argc, argv, &params);

// Create pipe to communicate between main and command process.


if (pipe(params.fd) < 0)
die("Failed to create pipe: %m");

// Clone command process.


int clone_flags = SIGCHLD | CLONE_NEWUTS ;
int cmd_pid = clone(cmd_exec, cmd_stack + STACKSIZE, clone_flags, &params);

if (cmd_pid < 0)
die("Failed to clone: %m\n");

// Get the writable end of the pipe.


int pipe = params.fd[1];

// Some namespace setup will take place here ...

// Signal to the command process we're done with setup.


if (write(pipe, "OK", 2) != 2)
die("Failed to write to pipe: %m");
if (close(pipe))
die("Failed to close pipe: %m");

if (waitpid(cmd_pid, NULL, 0) == -1)


die("Failed to wait pid %d: %m\n", cmd_pid);

return 0;
}

Check out clone_flags that we pass to our clone invocation, See how dead simple it is to
create a new process in its own namespace? All we have to do is set the flag for the
namespace type (the CLONE_NEWUTS flag corresponds to the UTS namespace) and Linux takes
care of the rest.

Next, the command process waits for a signal before exec-ing the command:
static int cmd_exec(void *arg)
{
// Kill the cmd process if the isolate process dies.
if (prctl(PR_SET_PDEATHSIG, SIGKILL))
die("cannot PR_SET_PDEATHSIG for child process: %m\n");

struct params *params = (struct params*) arg;


// Wait for 'setup done' signal from the main process.
await_setup(params->fd[0]);

char **argv = params->argv;


char *cmd = argv[0];
printf("===========%s============\n", cmd);

if (execvp(cmd, argv) == -1)


die("Failed to exec %s: %m\n", cmd);

die("¯\\_(ツ)_/¯");
return 1;
}

Finally we can try it out:


$ ./isolate sh
===========sh============
$ ls
isolate isolate.c isolate.o Makefile
$ hostname
iffy
$ hostname coke
$ hostname
coke
# Verify in a new terminal that hostname hasn't been updated

Currently, isolate is a little bit more than a program that just forks off a command (we do
have the UTS thing going for us). In the next post, we take it a step further by looking at User
namespaces and have isolate run the command in its own User namespace. There, we will
see that we actually need to do some work in order to have a usable namespace in which the
command can run.

You might also like