How docker internally handles resource limit using linux control groups

While using docker it is quite natural to feel like we are dealing with a virtual machine because we can install something in a container, it has an IP, we can ssh it, we can allocate memory and CPU to it like any virtual machine but in reality, containers are just a mechanism that allows Linux processes to run with some degree of isolation and separation. Containers basically use few different primitives in Linux combined. If containers are just based on Linux primitives how are we being able to set limits to memories and CPUs? The answer is very simple and it is already available in the Linux system, docker takes leverage of Linux Control groups.

What are Control groups? Control groups are commonly known as cgroups. Cgroups are the abstract frameworks in Linux systems for tracking, grouping, and organizing Linux processes. No matter what process it is, every process in a Linux system is tracked by one or more cgroups. Typically cgroups are used to associate processes with a resource. We can leverage cgroups to track how much a particular group of processes is using for a specific type of resource. Cgroup plays a big role when we are dealing with multitenancy because it enables us to limit or prioritize specific resources for a group of processes. It is particularly necessary because we don’t want one of our resources to consume all the CPU or say io bandwidth. We can also lock a process to run a particular CPU core using cgroup as well.

We can interact with the abstract framework of cgroups through subsystems. In fact, the subsystems are the concrete implementations that are bound to resources. Some of the Linux Subsystems are Memory, CPU time, PIDs, Devices, Networks. Each and every subsystem are independent of each other. They have the capability to organize their own processes separately. One process can be part of two independent cgroups. All of the c group subsystems organize their processes in a structured hierarchy. Controllers are responsible for distributing specific types of resources along the hierarchy Each subsystem has an independent hierarchy. Every task or process ID running on the host is represented in exactly one of the c groups within a given subsystem hierarchy. These independent hierarchies allow doing advanced process-specific resource segmentation. For example when two processes share the total amount of memory that they consume but we can provide one process more CPU time than the other.

This hierarchy is maintained in the directory and file structure. by default, it is mounted in /sys/fs/cgroup the directory but it can be mounted in another directory of choice as well. We can also mount multiple cgroup locations in multiple locations as well, which comes in handy when a single instance is using by multiple tenants and we want one tenant cgroup to be mounted on his disk area. Mounting cgroup can be done using the command:

mount -t cgroup2 none $MOUNT_POINT

Now lets explore the virtual filesystem of cgroup:

ls /sys/fs/cgroup
ls /sys/fs/cgroup/devices

Some of the resource controllers apply settings from the parent level to the child level, an example of such controller would be the devices controller. Others consider each level in the hierarchy independently, for example, the memory controller can be configured this way. In each directory, you’ll see a file called tasks. This file holds all of the process IDs for the processes assigned to that particular cgroup.

cat /sys/fs/cgroup/devices/tasks

It shows you all of the processes that are in this particular cgroup inside this particular subsystem but suppose that you have an arbitrary process and you want to find out which c groups it’s assigned to you can do that with the proc virtual file system. Let’s look at that the proc file system contains a directory that corresponds to each process id.

ls /proc

To see our current process we can use the command:

 echo $$
cat /proc/<pid>/cgroups

Let’s say I want to monitor a group for how much memory it is using. We can read the virtual file system and see what it returns.

cat /sys/fs/cgroup/memory/memory.usage_in_bytes

What are seeing these files and directories everywhere? It just interfaces into the kernel’s data structures for cgroups. Each directory has a distinct structure. Even if you create a new directory, it will automatically create a bunch of files to match up. Let’s create a cgroup. To create a cgroup, we just need to create a directory, at least this is how the kernel keeps track of it.

sudo mkdir /sys/fs/cgroup/pids/new_cgroup
ls /sys/fs/cgroup/pids/new_cgroup
cat /sys/fs/cgroup/pids/new_cgroup/tasks

The kernel keeps track of these processes using this directory and files. So adding or removing processes or changing any settings is nothing but changing the content of the files.

cat /sys/fs/cgroup/pids/new_cgroup/tasks
echo $$ | sudo tee /sys/fs/cgroup/pids/new_cgroup/tasks

You have written one line to the task files, haven’t you? Now let’s see what it is inside the tasks file:

cat /sys/fs/cgroup/pids/new_cgroup/tasks

 

We are supposed to see at least 2 files because when a new process is started it begins in the same cgroups as its parent process. When we try to use command cat, our shell starts another process and it appears in the tasks file.

 

We can limit the number of processes that a cgroup is allowed to run by modifying pids.max file.

Echo 2 | sudo tee /sys/fs/cgroup/pids/new_cgroup/pids.max

Now let’s try to run 3 processes instead of two and it is going to crash our shell.

 

Now that we have a basic understanding of cgroups. Let’s investigate cgroups inside our docker containers.

Let’s try to run a docker container with a CPU limit of 512 and explore the cgroups.

docker run —name test —cpu-shares 512 -d —rm buxybox sleep 1000

docker exec demo ls /sys/fs/cgroup/cpu

docker exec demo ls /sys/fs/cgroup/cpu/cpu.shares

 

So basically docker is using my commands to manipulate the group setting files to get things done. Interesting indeed. If it is not a virtual machine, these files are supposed to be on our machine too, isn’t it? yes, you got it right. Usually, these files are located somewhere in /sys/fs/cgroup/cpu/docker. Usually, there is a directory with a 256 hash that contains the full docker id.

ls /sys/fs/cgroup/cpu/docker
cat /sys/fs/cgroup/cpu/docker/<256big>/cpu.shares

Cgroup mechanism is tied to namespaces. If we have a particular namespace defined then we can add the namespace to the cgroup and everything that is a member of the cgoup becomes controlled by the cgroup.

Just want to give a word of warning before modifying anything of a cgroup we should keep in mind that dependencies there might be on the existing hierarchy. For example, amazon ECS or google c advisor uses the secret hierarchy to know where to read CPU and memory utilization information.