README.md

Documentation: k8s.io/kubernetes/pkg/proxy/nftables

     1# NFTables kube-proxy
     2
     3This is an implementation of service proxying via the nftables API of
     4the kernel netfilter subsystem.
     5
     6## General theory of netfilter
     7
     8Packet flow through netfilter looks something like:
     9
    10```text
    11             +================+      +=====================+
    12             | hostNetwork IP |      | hostNetwork process |
    13             +================+      +=====================+
    14                         ^                |
    15  -  -  -  -  -  -  -  - | -  -  -  -  - [*] -  -  -  -  -  -  -  -  -
    16                         |                v
    17                     +-------+        +--------+
    18                     | input |        | output |
    19                     +-------+        +--------+
    20                         ^                |
    21      +------------+     |   +---------+  v      +-------------+
    22      | prerouting |-[*]-+-->| forward |--+-[*]->| postrouting |
    23      +------------+         +---------+         +-------------+
    24            ^                                           |
    25 -  -  -  - | -  -  -  -  -  -  -  -  -  -  -  -  -  -  |  -  -  -  -
    26            |                                           v
    27       +---------+                                  +--------+
    28   --->| ingress |                                  | egress |--->
    29       +---------+                                  +--------+
    30```
    31
    32where the `[*]` represents a routing decision, and all of the boxes except in the top row
    33represent netfilter hooks. More detailed versions of this diagram can be seen at
    34https://en.wikipedia.org/wiki/Netfilter#/media/File:Netfilter-packet-flow.svg and
    35https://wiki.nftables.org/wiki-nftables/index.php/Netfilter_hooks but note that in the the
    36standard version of this diagram, the top two boxes are squished together into "local
    37process" which (a) fails to make a few important distinctions, and (b) makes it look like
    38a single packet can go `input` -> "local process" -> `output`, which it cannot. Note also
    39that the `ingress` and `egress` hooks are special and mostly not available to us;
    40kube-proxy lives in the middle section of diagram, with the five main netfilter hooks.
    41
    42There are three paths through the diagram, called the "input", "forward", and "output"
    43paths, depending on which of those hooks it passes through. Packets coming from host
    44network namespace processes always take the output path, while packets coming in from
    45outside the host network namespace (whether that's from an external host or from a pod
    46network namespace) arrive via `ingress` and take the input or forward path, depending on
    47the routing decision made after `prerouting`; packets destined for an IP which is assigned
    48to a network interface in the host network namespace get routed along the input path;
    49anything else (including, in particular, packets destined for a pod IP) gets routed along
    50the forward path.
    51
    52## kube-proxy's use of nftables hooks
    53
    54Kube-proxy uses nftables for seven things:
    55
    56  - Using DNAT to rewrite traffic from service IPs (cluster IPs, external IPs, load balancer
    57    IP, and NodePorts on node IPs) to the corresponding endpoint IPs.
    58
    59  - Using SNAT to masquerade traffic as needed to ensure that replies to it will come back
    60    to this node/namespace (so that they can be un-DNAT-ed).
    61
    62  - Dropping packets that are filtered out by the `LoadBalancerSourceRanges` feature.
    63
    64  - Dropping packets for services with `Local` traffic policy but no local endpoints.
    65
    66  - Rejecting packets for services with no local or remote endpoints.
    67 
    68  - Dropping packets to ClusterIPs which are not yet allocated.
    69
    70  - Rejecting packets to undefined ports of ClusterIPs.
    71
    72This is implemented as follows:
    73
    74  - We do the DNAT for inbound traffic in `prerouting`: this covers traffic coming from
    75    off-node to all types of service IPs, and traffic coming from pods to all types of
    76    service IPs. (We *must* do this in `prerouting`, because the choice of endpoint IP may
    77    affect whether the packet then gets routed along the input path or the forward path.)
    78
    79  - We do the DNAT for outbound traffic in `output`: this covers traffic coming from
    80    host-network processes to all types of service IPs. Regardless of the final
    81    destination, the traffic will take the "output path". (In the case where a
    82    host-network process connects to a service IP that DNATs it to a host-network endpoint
    83    IP, the traffic will still initially take the "output path", but then reappear on the
    84    "input path".)
    85
    86  - `LoadBalancerSourceRanges` firewalling has to happen before service DNAT, so we do
    87    that on `prerouting` and `output` as well, with a lower (i.e. more urgent) priority
    88    than the DNAT chains.
    89
    90  - The `drop` and `reject` rules for services with no endpoints don't need to happen
    91    explicitly before or after any other rules (since they match packets that wouldn't be
    92    matched by any other rules). But with kernels before 5.9, `reject` is not allowed in
    93    `prerouting`, so we can't just do them in the same place as the source ranges
    94    firewall. So we do these checks from `input`, `forward`, and `output` for
    95    `@no-endpoint-services` and from `input` for `@no-endpoint-nodeports` to cover all
    96    the possible paths.
    97
    98  - Masquerading has to happen in the `postrouting` hook, because "masquerade" means "SNAT
    99    to the IP of the interface the packet is going out on", so it has to happen after the
   100    final routing decision. (We don't need to masquerade packets that are going to a host
   101    network IP, because masquerading is about ensuring that the packet eventually gets
   102    routed back to the host network namespace on this node, so if it's never getting
   103    routed away from there, there's nothing to do.)
   104
   105  - We install a `reject` rule for ClusterIPs matching `@cluster-ips` set and a `drop`
   106    rule for ClusterIPs belonging to any of the ServiceCIDRs in `forward` and `output` hook, with a 
   107    higher (i.e. less urgent) priority than the DNAT chains making sure all valid
   108    traffic directed for ClusterIPs is already DNATed. Drop rule will only
   109    be installed if `MultiCIDRServiceAllocator` feature is enabled.
View as plain text