How to reinvent containers (with one weird trick)

Posted by jspc

Containers are at once both simple and difficult to grok. One lunchtime, while trying to understand the internals of docker, I accidentally created a container standard which must never, ever see the light of production. I called it Navvy.

Linux containers are, in their purest form, made up of three technologies: chroot, cgroups and network namespaces. Some of these technologies have been around since year dot. Some of these technologies are from the last couple of years.

The toolset

chroot jails

chroot as a concept/ system call initially appeared in V7 Unix and was first widely used in 4.2bsd in 1983. The system call, in its purest sense, changes where a process’ apparent root directory is. To the process, however, the root directory is, seemingly, still at /. This solves two problems:

  1. Isolating installations of files to preserve the Filesystem Hierarchy Standard
  2. Preventing a process accessing files outside of its scope

chroot is commonly used in systems maintenance and bootstrapping distributions such as gentoo (because, let’s face it, if you’re getting deep into the internals of containerisation you almost certainly have an old tiny conputers pentium III box somewhere with an ancient gentoo installation on it).

cgroups

cgroups are cool. In 2006 a pair of Google engineers started a project called process containers with the purpose of throttling processes at the kernel level. The name was shortly after changed to Linux Control Groups (cgroups) because of the confusion and ambiguity around the word container in kernel-space.

cgroups allow us to limit the resources a process can run. Because UNIX processes are hierarchical, children of the cgroup‘d process are in the same cgroup and will contribute to the same limit.

network namespaces

network namespaces are, in effect, another distinct copy of the network stack; they have their own routes, firewall rules, devices and magic. This allows for several interesting things:

  1. Separating traffic destined for one container from others
  2. Traffic shaping at the container level
  3. The ability to route processes differently in different circumstances

Linux has had this technology as part of netns (another Google project) since 2.6.27; which makes network namespaces the youngest member of the container toolchain.

Containerising

In order to play with containers we need an operating system to run. The following examples use a centos 7 container running on Arch linux. The best way to start is to unpack the filesystem from a livecd into a directory:

 $ mkdir centos7 squash root navvy-1
 $ wget -q http://centos.openitc.uk/7.0.1406/isos/x86_64/CentOS-7.0-1406-x86_64-DVD.iso
 $ sudo mount -o loop,ro CentOS-7.0-1406-x86_64-DVD.iso centos7/
 $ sudo mount -o loop,ro centos7/LiveOS/squashfs.img squashfs/
 $ sudo mount -o loop,ro squashfs/LiveOS/rootfs.img root/
 $ sudo cp -aR root/* navvy-1/
 $ sudo umount root/
 $ sudo umount squashfs/
 $ sudo umount centos7/
 $ rm -rf centos7 squash root CentOS-7.0-1406-x86_64-DVD.iso

We can use this version of Centos 7 now by issuing a simple

$ sudo chroot navvy-1/ /bin/bash

But this will, essentially, mean:

  1. We could run stuff here that steals resources from other resources on this machine
  2. We’re sharing the entirety of the network

Instead we’re going to setup some cgroup config around memory and CPU. We do this with:

 $ sudo cgcreate -a jspc -g memory,cpu:navvy-1

Which gives us, for instance:

 $ ls -l /sys/fs/cgroup/memory/navvy-1/
total 0
-rw-r--r-- 1 jspc root 0 Mar 12 15:13 cgroup.clone_children
--w--w--w- 1 jspc root 0 Mar 12 15:13 cgroup.event_control
-rw-r--r-- 1 jspc root 0 Mar 12 15:13 cgroup.procs
-rw-r--r-- 1 jspc root 0 Mar 12 15:13 memory.failcnt
--w------- 1 jspc root 0 Mar 12 15:13 memory.force_empty
-rw-r--r-- 1 jspc root 0 Mar 12 15:13 memory.limit_in_bytes
-rw-r--r-- 1 jspc root 0 Mar 12 15:13 memory.max_usage_in_bytes
-rw-r--r-- 1 jspc root 0 Mar 12 15:13 memory.move_charge_at_immigrate

We’re going to set a limit of 1GB, and 10% of the CPU. Conceptually, it would appear that by putting a container into this group and setting a per-process limit of 1GB we can get around this limitation by using many processes in this container. This is not an issue due to the way chroot, and indeed UNIX, works. Processes inherit resource pools and namespaces from their parent; because init is generally run with All The Resources available processes aren’t throttled until specifically set. When we put the chroot process under a cgroup we limit the pool that both it and its children have access to. Thus 1GB and 10% for the container.

 $ echo 1000000000 > /sys/fs/cgroup/memory/navvy-1/memory.limit_in_bytes
 $ echo 102 > /sys/fs/cgroup/cpu/navvy-1/cpu.shares

n.b.: groups have 1024 shares each which cascade. Because navvy-1 is a top level cgroup 102 equates to ca. 10%

To chroot with this limit we simply run our command as per:

 $ sudo cgexec -g memory,cpuset:navvy-1  chroot navvy-1/ /bin/bash

The final task is to create a new network stack:

 $ sudo ip netns add navvy-1
 $ sudo ip link add veth0 type veth peer name veth1
 $ sudo ip link set veth1 netns navvy-1
 $ sudo ip netns exec navvy-1 ifconfig veth1 10.10.10.10/24 up

And to then bridge veth0 to wherever your network lives (which will vary per platform).

We can then finally use our container as per:

 $ sudo ip netns exec navvy-1 cgexec -g memory,cpuset:navvy-1  chroot navvy-1/ /bin/bash

But what does it all mean?

Containers are conceptually simple to grok; they separate memory, CPU and networking off and jail a process to its own set of binaries. Easy. In practice, though, they can get difficult to run outside of an all encompassing tool such as docker or even the CoreOS offering rocket.

Ultimately, though, through understanding the underlying technologies and toolchains, and how they control our containers, we can start to piece together interesting solutions to problems. Having access to the cgroup layer, for instance, means we could scale up or down an application’s access to memory as we go along; or to even start routing an application’s traffic via another host to start snarfing data.