Some Container History
Virtual Machines, allowing the emulation of hardware and running operating systems inside one another, have existed for over 50 years. They’re great for hosting heterogeneous systems, but have a performance overhead.
Docker started getting traction in 2013, allowing the easy creation of environments for isolating processes while still using the same Linux kernel – less overhead than VMs, as long as you’re using Linux!
Today there’s a large ecosystem around container runtimes, orchestration and other tooling – see more at https://landscape.cncf.io/
Core Isolation Technologies
Be it Docker with the long-running daemon, or something conceptually simpler like podman or runc, containers are built on two pieces of technology to permit isolation while using the same kernel – Linux namespaces and cgroups (container runtimes with ‘heavier’ isolation, like Kata Containers, are exceptions).
Container runtimes use namespaces for each container to create partitions of global resources, allowing processes to execute without being able to influence each other. We’ll look at the mount and net namespaces – the others will be left as an exercise for the reader!
The mount namespace may feel familiar to users of chroots (which allow the presentation of a linux directory as a root filesystem). The key practical differences between mount namespaces and chroots are that a) namespaces are cleaned up automatically (worth the price of entry in its own right in the opinion of the author!) and b) mounts in a mount namespace can be hidden from the host.
Container runtimes use this namespace for setup of a container filesystem and volumes. Much of its power comes from general filesystem mounting functionality in Linux – be it mounting a read-only squashfs as the root with a bind-mounted writable directory as /home, or using a layered filesystem (such as overlayfs) to recreate Docker.
Next up is the network namespace. When creating a new network namespace, Linux will insert a loopback interface…which is only useful for talking to yourself (try it by using ––net=none with Docker)! For ease of use, container runtimes will usually set up networking of a container for external network access by default.
A networking namespace can be modified with standard Linux tools. Because of this, you can create extremely powerful applications like overlay networks (allowing containers on different hosts to communicate directly by IP as if they’re on the same network) and enforcing network policies between containers.
Linux Namespaces: Hands On
I recommend looking at bubblewrap if you want to write the code yourself, or the unshare command if you want to try out namespaces interactively. You might also be interested in using strace to see exactly how unshare or bubblewrap do their work. As a hint, the core of any container runtime is the unshare system call (different to the unshare command)!
This strace output for a command running inside a pid namespace shows the unshare syscall used to ‘enter’ the mount and pid namespace: