Over the past few months, big changes have been underway on the cgroup Linux kernel subsystem and its related, but independent, system and service manager Systemd. Developers aren’t building shiny new features, though, as much as overhauling cgroups (control groups) to impose more structure in an area of the kernel that’s become problematic.
Cgroup allows fine-grained resource partitioning among competing processes running on the same machine. It’s technically a kernel subsystem but it acts quite different than typical, more isolated subsystems such as drivers or architecture-specific systems like PCI or USB. Cgroup is a conduit for other subsystems to manage and query with kernel resources such as CPU time, amounts of memory, and groups of processes.
What’s the Issue?
The problem is that cgroups were often built independently of the developers most familiar with the kernel subsystems they interact with.
“This is partly because cgroup tends to add complexity and overhead to the existing subsystems and building and bolting something on the side is often the path of the least resistance,” said Tejun Heo, Linux kernel cgroup subsystem maintainer. “Combined with the fact that cgroup has been exploring new areas without firm established examples to follow, this led to some questionable design choices and relatively high level of inconsistency.”
The biggest issue this inconsistency created was what Heo calls “a major breach of standard kernel API practices.” Because the cgroup interface is the filesystem, it goes through much less scrutiny than other kernel APIs. The hierarchical nature of cgroup means users can change permissions on subdirectories and give access to a non-privileged security domain, ie non-root users, Heo said. This, in turn, means an individual application can interact directly with the cgroup filesystem and access the kernel control knobs, effectively exposing the raw knobs to the full kernel API without the required review.
Other issues include: “the inability to designate a resource to a cgroup due to the orthogonal multiple hierarchies, widespread inconsistencies in hierarchy handling, unnecessarily high level of complexity,” and more, says Heo.
How to fix it
Cgroup is made of two parts: the cgroup core creates a hierarchical classification of processes running on the system, while a set of 13 controllers link the core with the kernel subsystems. The memory controller, for example, limits the amount of memory a group of processes can allocate from the system, the block controller can limit the bandwidth to the disk input/output, and so on.
Kernel developers are now working to fix these issues by implementing a single unified hierarchy in the cgroup core and improving consistency among the controllers. But because of the patchwork nature of the subsystem and the need to ensure backward compatibility, they won’t be able to completely stop this abuse. That’s where systemd, and any other control agents that may emerge, comes in.
Systemd is the common tool for Linux system administrators to control resources. It relies on cgroups to track the state of services, logged-in users, and virtual machines and does so by exposing the kernel resource control knobs to the administrator, said systemd developer Kay Sievers.
Systemd and cgroup developers are working together to turn systemd into a global cgroup manager that creates higher-level control knobs and prevents direct access to the kernel. Many Systemd changes are already released while cgroup changes are set to be merged into the upstream kernel. Much work still remains, however.
The conversion of the separate controller hierarchy into a single, unified hierarchy will be a “gigantic job” for the kernel and user land, alike, Sievers said.
“When complete, the above efforts will give us far more structured way to think about and interact with cgroups,” Heo said, “which in the long run will make cgroup more useful to wider audience and enable capabilities which are currently not possible.”
For more details about the changes, see the cgroup documentation on kernel.org and the systemd man pages: