...
1# Secure Alertmanager cluster traffic
2
3Type: Design document
4
5Date: 2019-02-21
6
7Author: Max Inden <IndenML@gmail.com>
8
9
10## Status Quo
11
12Alertmanager supports [high
13availability](https://github.com/prometheus/alertmanager/blob/master/README.md#high-availability)
14by interconnecting multiple Alertmanager instances building an Alertmanager
15cluster. Instances of a cluster communicate on top of a gossip protocol managed
16via Hashicorps [_Memberlist_](https://github.com/hashicorp/memberlist) library.
17_Memberlist_ uses two channels to communicate: TCP for reliable and UDP for
18best-effort communication.
19
20Alertmanager instances use the gossip layer to:
21
22- Keep track of membership
23- Replicate silence creation, update and deletion
24- Replicate notification log
25
26As of today the communication between Alertmanager instances in a cluster is
27sent in clear-text.
28
29
30## Goal
31
32Instances in a cluster should communicate among each other in a secure fashion.
33Alertmanager should guarantee confidentiality, integrity and client authenticity
34for each message touching the wire. While this would improve the security of
35single datacenter deployments, one could see this as a necessity for
36wide-area-network deployments.
37
38
39## Non-Goal
40
41Even though solutions might also be applicable to the API endpoints exposed by
42Alertmanager, it is not the goal of this design document to secure the API
43endpoints.
44
45
46## Proposed Solution - TLS Memberlist
47
48_Memberlist_ enables users to implement their own [transport
49layer](https://godoc.org/github.com/hashicorp/memberlist#Transport) without the
50need of forking the library itself. That transport layer needs to support
51reliable as well as best-effort communication. Instead of using TCP and UDP like
52the default transport layer of _Memberlist_, the suggestion is to only use TCP
53for both reliable as well as best-effort communication. On top of that TCP
54layer, one can use mutual TLS to secure all communication. A proof-of-concept
55implementation can be found here:
56https://github.com/mxinden/memberlist-tls-transport.
57
58The data gossiped between instances does not have a low-latency requirement that
59TCP could not fulfill, same would apply for the relatively low data throughput
60requirements of Alertmanager.
61
62TCP connections could be kept alive beyond a single message to reduce latency as
63well as handshake overhead costs. While this is feasible in a 3-instance
64Alertmanager cluster, the discussed custom implementation would need to limit
65the amount of open connections for clusters with many instances (#connections =
66n*(n-1)/2).
67
68As of today, Alertmanager already forces _Memberlist_ to use the reliable TCP
69instead of the best-effort UDP connection to gossip large notification logs and
70silences between instances. The reason is, that those packets would otherwise
71exceed the [MTU](https://en.wikipedia.org/wiki/Maximum_transmission_unit) of
72most UDP setups. Splitting packets is not supported by _Memberlist_ and was not
73considered worth the effort to be implemented in Alertmanager either. For more
74info see this [Github
75issue](https://github.com/prometheus/alertmanager/issues/1412).
76
77With the last [Prometheus developer
78summit](https://docs.google.com/document/d/1-C5PycocOZEVIPrmM1hn8fBelShqtqiAmFptoG4yK70/edit)
79in mind, the Prometheus projects preferred security mechanism seems to be mutual
80TLS. Having Alertmanager use the same mechanism would ease deployment with the
81rest of the Prometheus stack.
82
83As a side effect (benefit) Alertmanager would only need a single open port (TCP
84traffic) instead of two open ports (TCP and UDP traffic) for cluster
85communication. This does not affect the API endpoint which remains a separate
86TCP port.
87
88
89## Alternative Solutions
90
91### Symmetric Memberlist
92
93_Memberlist_ supports [symmetric key
94encryption](https://godoc.org/github.com/hashicorp/memberlist#Keyring) via
95AES-128, AES-192 or AES-256 ciphers. One can specify multiple keys for rolling
96updates. Securing the cluster traffic via symmetric encryption would just
97involve small configuration changes in the Alertmanager code base.
98
99
100### Replace Memberlist
101
102Coordinating membership might not be required by the Alertmanager cluster
103component. Instead this could be bound to static configuration or e.g. DNS
104service discovery. On the other hand, gossiping silences and notifications is
105ideally done in an eventual consistent gossip fashion, given that Alertmanager
106is supposed to scale beyond a 3-instance cluster and beyond local-area-network
107deployments. With these requirements in mind, replacing _Memberlist_ with an
108entirely self-built communication layer is a great undertaking.
109
110
111### TLS Memberlist with DTLS
112
113Instead of redirecting all best-effort traffic via the reliable channel as
114proposed above, one could also secure the best-effort channel itself using UDP
115and [DTLS](https://en.wikipedia.org/wiki/Datagram_Transport_Layer_Security) in
116addition to securing the reliable traffic via TCP and TLS. DTLS is not supported
117by the Golang standard library.
View as plain text