...
1## systemd cgroup driver
2
3By default, runc creates cgroups and sets cgroup limits on its own (this mode
4is known as fs cgroup driver). When `--systemd-cgroup` global option is given
5(as in e.g. `runc --systemd-cgroup run ...`), runc switches to systemd cgroup
6driver. This document describes its features and peculiarities.
7
8### systemd unit name and placement
9
10When creating a container, runc requests systemd (over dbus) to create
11a transient unit for the container, and place it into a specified slice.
12
13The name of the unit and the containing slice is derived from the container
14runtime spec in the following way:
15
161. If `Linux.CgroupsPath` is set, it is expected to be in the form
17 `[slice]:[prefix]:[name]`.
18
19 Here `slice` is a systemd slice under which the container is placed.
20 If empty, it defaults to `system.slice`, except when cgroup v2 is
21 used and rootless container is created, in which case it defaults
22 to `user.slice`.
23
24 Note that `slice` can contain dashes to denote a sub-slice
25 (e.g. `user-1000.slice` is a correct notation, meaning a subslice
26 of `user.slice`), but it must not contain slashes (e.g.
27 `user.slice/user-1000.slice` is invalid).
28
29 A `slice` of `-` represents a root slice.
30
31 Next, `prefix` and `name` are used to compose the unit name, which
32 is `<prefix>-<name>.scope`, unless `name` has `.slice` suffix, in
33 which case `prefix` is ignored and the `name` is used as is.
34
352. If `Linux.CgroupsPath` is not set or empty, it works the same way as if it
36 would be set to `:runc:<container-id>`. See the description above to see
37 what it transforms to.
38
39As described above, a unit being created can either be a scope or a slice.
40For a scope, runc specifies its parent slice via a _Slice=_ systemd property,
41and also sets _Delegate=true_. For a slice, runc specifies a weak dependency on
42the parent slice via a _Wants=_ property.
43
44### Resource limits
45
46runc always enables accounting for all controllers, regardless of any limits
47being set. This means it unconditionally sets the following properties for the
48systemd unit being created:
49
50 * _CPUAccounting=true_
51 * _IOAccounting=true_ (_BlockIOAccounting_ for cgroup v1)
52 * _MemoryAccounting=true_
53 * _TasksAccounting=true_
54
55The resource limits of the systemd unit are set by runc by translating the
56runtime spec resources to systemd unit properties.
57
58Such translation is by no means complete, as there are some cgroup properties
59that can not be set via systemd. Therefore, runc systemd cgroup driver is
60backed by fs driver (in other words, cgroup limits are first set via systemd
61unit properties, and when by writing to cgroupfs files).
62
63The set of runtime spec resources which is translated by runc to systemd unit
64properties depends on kernel cgroup version being used (v1 or v2), and on the
65systemd version being run. If an older systemd version (which does not support
66some resources) is used, runc do not set those resources.
67
68The following tables summarize which properties are translated.
69
70#### cgroup v1
71
72| runtime spec resource | systemd property name | min systemd version |
73|-----------------------|-----------------------|---------------------|
74| memory.limit | MemoryLimit | |
75| cpu.shares | CPUShares | |
76| blockIO.weight | BlockIOWeight | |
77| pids.limit | TasksMax | |
78| cpu.cpus | AllowedCPUs | v244 |
79| cpu.mems | AllowedMemoryNodes | v244 |
80
81#### cgroup v2
82
83| runtime spec resource | systemd property name | min systemd version |
84|-------------------------|-----------------------|---------------------|
85| memory.limit | MemoryMax | |
86| memory.reservation | MemoryLow | |
87| memory.swap | MemorySwapMax | |
88| cpu.shares | CPUWeight | |
89| pids.limit | TasksMax | |
90| cpu.cpus | AllowedCPUs | v244 |
91| cpu.mems | AllowedMemoryNodes | v244 |
92| unified.cpu.max | CPUQuota, CPUQuotaPeriodSec | v242 |
93| unified.cpu.weight | CPUWeight | |
94| unified.cpuset.cpus | AllowedCPUs | v244 |
95| unified.cpuset.mems | AllowedMemoryNodes | v244 |
96| unified.memory.high | MemoryHigh | |
97| unified.memory.low | MemoryLow | |
98| unified.memory.min | MemoryMin | |
99| unified.memory.max | MemoryMax | |
100| unified.memory.swap.max | MemorySwapMax | |
101| unified.pids.max | TasksMax | |
102
103For documentation on systemd unit resource properties, see
104`systemd.resource-control(5)` man page.
105
106### Auxiliary properties
107
108Auxiliary properties of a systemd unit (as shown by `systemctl show
109<unit-name>` after the container is created) can be set (or overwritten) by
110adding annotations to the container runtime spec (`config.json`).
111
112For example:
113
114```json
115 "annotations": {
116 "org.systemd.property.TimeoutStopUSec": "uint64 123456789",
117 "org.systemd.property.CollectMode":"'inactive-or-failed'"
118 },
119```
120
121The above will set the following properties:
122
123* `TimeoutStopSec` to 2 minutes and 3 seconds;
124* `CollectMode` to "inactive-or-failed".
125
126The values must be in the gvariant text format, as described in
127[gvariant documentation](https://docs.gtk.org/glib/gvariant-text.html).
128
129To find out which type systemd expects for a particular parameter, please
130consult systemd sources.
View as plain text