...

Text file src/github.com/prometheus/alertmanager/doc/design/secure-cluster-traffic.md

Documentation: github.com/prometheus/alertmanager/doc/design

     1# Secure Alertmanager cluster traffic
     2
     3Type: Design document
     4
     5Date: 2019-02-21
     6
     7Author: Max Inden <IndenML@gmail.com>
     8
     9
    10## Status Quo
    11
    12Alertmanager supports [high
    13availability](https://github.com/prometheus/alertmanager/blob/master/README.md#high-availability)
    14by interconnecting multiple Alertmanager instances building an Alertmanager
    15cluster. Instances of a cluster communicate on top of a gossip protocol managed
    16via Hashicorps [_Memberlist_](https://github.com/hashicorp/memberlist) library.
    17_Memberlist_ uses two channels to communicate: TCP for reliable and UDP for
    18best-effort communication.
    19
    20Alertmanager instances use the gossip layer to:
    21
    22- Keep track of membership
    23- Replicate silence creation, update and deletion
    24- Replicate notification log
    25
    26As of today the communication between Alertmanager instances in a cluster is
    27sent in clear-text.
    28
    29
    30## Goal
    31
    32Instances in a cluster should communicate among each other in a secure fashion.
    33Alertmanager should guarantee confidentiality, integrity and client authenticity
    34for each message touching the wire. While this would improve the security of
    35single datacenter deployments, one could see this as a necessity for
    36wide-area-network deployments.
    37
    38
    39## Non-Goal
    40
    41Even though solutions might also be applicable to the API endpoints exposed by
    42Alertmanager, it is not the goal of this design document to secure the API
    43endpoints.
    44
    45
    46## Proposed Solution - TLS Memberlist
    47
    48_Memberlist_ enables users to implement their own [transport
    49layer](https://godoc.org/github.com/hashicorp/memberlist#Transport) without the
    50need of forking the library itself. That transport layer needs to support
    51reliable as well as best-effort communication. Instead of using TCP and UDP like
    52the default transport layer of _Memberlist_, the suggestion is to only use TCP
    53for both reliable as well as best-effort communication. On top of that TCP
    54layer, one can use mutual TLS to secure all communication. A proof-of-concept
    55implementation can be found here:
    56https://github.com/mxinden/memberlist-tls-transport.
    57
    58The data gossiped between instances does not have a low-latency requirement that
    59TCP could not fulfill, same would apply for the relatively low data throughput
    60requirements of Alertmanager.
    61
    62TCP connections could be kept alive beyond a single message to reduce latency as
    63well as handshake overhead costs. While this is feasible in a 3-instance
    64Alertmanager cluster, the discussed custom implementation would need to limit
    65the amount of open connections for clusters with many instances (#connections =
    66n*(n-1)/2).
    67
    68As of today, Alertmanager already forces _Memberlist_ to use the reliable TCP
    69instead of the best-effort UDP connection to gossip large notification logs and
    70silences between instances. The reason is, that those packets would otherwise
    71exceed the [MTU](https://en.wikipedia.org/wiki/Maximum_transmission_unit) of
    72most UDP setups. Splitting packets is not supported by _Memberlist_ and was not
    73considered worth the effort to be implemented in Alertmanager either. For more
    74info see this [Github
    75issue](https://github.com/prometheus/alertmanager/issues/1412).
    76
    77With the last [Prometheus developer
    78summit](https://docs.google.com/document/d/1-C5PycocOZEVIPrmM1hn8fBelShqtqiAmFptoG4yK70/edit)
    79in mind, the Prometheus projects preferred security mechanism seems to be mutual
    80TLS. Having Alertmanager use the same mechanism would ease deployment with the
    81rest of the Prometheus stack.
    82
    83As a side effect (benefit) Alertmanager would only need a single open port (TCP
    84traffic) instead of two open ports (TCP and UDP traffic) for cluster
    85communication. This does not affect the API endpoint which remains a separate
    86TCP port.
    87
    88
    89## Alternative Solutions
    90
    91### Symmetric Memberlist
    92
    93_Memberlist_ supports [symmetric key
    94encryption](https://godoc.org/github.com/hashicorp/memberlist#Keyring) via
    95AES-128, AES-192 or AES-256 ciphers. One can specify multiple keys for rolling
    96updates. Securing the cluster traffic via symmetric encryption would just
    97involve small configuration changes in the Alertmanager code base.
    98
    99
   100### Replace Memberlist
   101
   102Coordinating membership might not be required by the Alertmanager cluster
   103component. Instead this could be bound to static configuration or e.g. DNS
   104service discovery. On the other hand, gossiping silences and notifications is
   105ideally done in an eventual consistent gossip fashion, given that Alertmanager
   106is supposed to scale beyond a 3-instance cluster and beyond local-area-network
   107deployments. With these requirements in mind, replacing _Memberlist_ with an
   108entirely self-built communication layer is a great undertaking.
   109
   110
   111### TLS Memberlist with DTLS
   112
   113Instead of redirecting all best-effort traffic via the reliable channel as
   114proposed above, one could also secure the best-effort channel itself using UDP
   115and [DTLS](https://en.wikipedia.org/wiki/Datagram_Transport_Layer_Security) in
   116addition to securing the reliable traffic via TCP and TLS. DTLS is not supported
   117by the Golang standard library.

View as plain text