1 // Datawire-internal note: I've written these docs from the perspective that we 2 // intend to move this code into Ambassador OSS in the (near) future. Datawire 3 // folks also have access to the saas_app repository, which contains the 4 // implementation of the AgentCom and Director. External folks do not, so I've 5 // glossed over some of the details. Please improve these docs, but keep that 6 // constraint in mind. -- Abhay (@ark3) 7 8 /* 9 Package agent implements the Agent component in Ambassador. 10 11 The Agent is responsible for communicating with a cloud service run by Datawire. 12 It was introduced in AES 1.7.0. Ultimately, the goal is to be able to present a 13 cloud-based UI similar to the Edge Policy Console, but for now we just want to 14 display some information about what this AES knows is running in the cluster. 15 16 # Implementation Goals 17 18 Minimal impact when disabled. The Agent is optional. If the user does not turn 19 on the Agent, the associated code should do very little, thereby having almost 20 no impact on the rest of Ambassador. This is no different from any other opt-in 21 feature in Ambassador. 22 23 Tolerant of external factors. When the Agent is enabled, it talks to a cloud 24 service run by Datawire. This means it is possible for things outside the user’s 25 cluster, that have nothing to do with the user’s usage of Ambassador, to affect 26 or even break that installation. Datawire could make a mistake, or there could 27 be an outage of their infrastructure outside of their control, or... The point 28 is, things outside the cluster that are not chosen by the user have now become 29 possible sources of failure for the user’s Ambassador installation. The Agent 30 must be robust enough to avoid precipitating such failures. 31 32 This is different from other opt-in features, because there is the potential for 33 external factors to break Ambassador that were not introduced by the user, but 34 rather by Datawire. 35 36 # Overview 37 38 Datawire runs a microservice called AgentCom that implements the Director gRPC 39 service. The client for that service is the Agent; it runs in the user’s 40 Ambassador. To enable the Agent, the user must add configuration in the 41 Ambassador module, including the Agent’s account ID, which the user must obtain 42 from the online application. 43 44 If the Agent is enabled, it sends a snapshot of the current state of the cluster 45 to the Director on startup and whenever things change. This is done via the 46 Director’s Report method. At present, the report snapshot includes identity 47 information about this Ambassador and a small amount of information about each 48 Kubernetes Service known to this Ambassador. 49 50 The Agent also pulls directives from the Director and executes them. This is 51 done via the Director’s Retrieve method, which establishes a gRPC stream of 52 Directive messages flowing from the Director to the Agent. 53 54 Each Directive includes some flow control information (to tell the Agent to stop 55 sending reports or send them less frequently) and a list of commands for the 56 Agent to execute. In the future, these commands will be the mechanism to allow 57 the cloud UI to configure Ambassador and the cluster on behalf of the user. For 58 now, aside from flow control, the only command implemented is to log a short 59 string to Ambassador's log. 60 61 # Design layers 62 63 * Protocol Buffers for data 64 65 Messages between the Agent and the Director are implemented using Protocol 66 Buffers (Proto3). Protobuf presents a straightforward story around forward and 67 backward compatibility. Both endpoints need to be written with the following in 68 mind: every field in a message is optional; unrecognized fields are ignored. 69 70 This makes it possible to add or remove (really, stop using) fields. If you add 71 a field, old code simply ignores it when decoding and leaves it unset when 72 encoding. If you stop using a field, old code will keep setting it when encoding 73 and see that it is unset when decoding. New code must account for old code 74 behaving that way, but does not otherwise need to consider message versioning 75 explicitly. 76 77 Of course, not every field can really be optional. For example, a report from 78 the Agent is syntactically valid without an account ID, but it is not 79 semantically meaningful. It is up to the software at the endpoints to report an 80 error when a message is invalid. 81 82 * gRPC for communication 83 84 By using gRPC for the communication protocol between the Agent and the Director, 85 we gain a number of well-tested features for relatively low cost. gRPC is built 86 on HTTP/2, which is generally well-supported in locked-down environments and 87 works well with Envoy. 88 89 Generated code and the associated library together enable type-safe RPCs from 90 the Agent to the Director, offering a simple interface for handling 91 serialization, streaming messages to avoid polling, connection multiplexing, 92 automatic retries with exponential backoff, and TLS. The generated API is 93 straightforward imperative, blocking code even though there is a lot of 94 machinery running concurrently under the hood to make this fast and responsive. 95 As gRPC is built on top of Protocol Buffers, it has standard error types for 96 Proto-specific cases such as semantically invalid messages in addition to types 97 for typical RPC errors. 98 99 * Simple communication layer 100 101 There is a small set of Go code that uses the generated gRPC methods. The 102 RPCComm Go structure encapsulates the gRPC client state, including its Go 103 context, and tracks the Goroutine required to handle streaming responses from 104 the Retrieve call. Once it has been created, the RPCComm communicates with the 105 rest of the code via Go channels. RPCComm has a wrapper around the Report method 106 that makes sure the Retrieve call is running. 107 108 * Reporting layer 109 110 The main Agent code has to do several things, and thus is somewhat complicated. 111 However, it is written in an event-driven manner, and nearly every computation 112 it performs is contained in a separate function that can be tested 113 independently. Note that actual test coverage is very thin for now. 114 115 The main loop blocks on Go channels listening for events. When it wakes up, it 116 handles the received event, reports to the Director if appropriate, and loops. 117 118 The Agent decides to send a report if it is configured to do so, reporting has 119 not been stopped by the Director, new information is available to send, and no 120 prior Report RPC is running. It performs the RPC in a separate single-shot 121 Goroutine to avoid blocking the loop. That Goroutine performs the RPC, sleeps 122 for a period, and then sends the result of the RPC over a channel as an event to 123 the main loop. 124 125 The code will not launch multiple RPCs (or Goroutines); it insists on each RPC 126 finishing before launching a new one. There is no queue of pending reports; the 127 loop only remembers the most recent report. An RPC error or timeout does not end 128 the loop; the error is logged and the loop continues. The RPC Goroutine sleeps 129 after performing the RPC to provide a simple, adjustable rate limit to 130 reporting. The loop receives the RPC result as an event; that is its indication 131 that the RPC is done. 132 133 The loop also receives directives as events. The directive is executed right 134 away in the same Goroutine, so commands must be fast/non-blocking for now. As 135 the only command available is to log a simple string, this is not a problem. 136 Directives can also include a flag to tell the Agent to stop reporting and a 137 duration to modify the reporting rate. 138 139 Finally, the loop receives new Watt snapshots as events. It uses the snapshot, 140 which includes everything this Ambassador knows about the cluster, to generate a 141 new report. If the new report is different from the last report that was sent, 142 the Agent stores the new report as the next one to be sent. The snapshot also 143 includes the information needed to determine whether the user has enabled the 144 Agent (in the Ambassador Module). So the Agent must receive and process 145 snapshots, even if all it discovers is that it is not enabled and doesn’t need 146 to do anything else. 147 148 Connectivity to the Director is handled by the communication layer described 149 above. The RPCComm instance is first created when the Agent decides to report. 150 If the Agent never needs to report, e.g., because it is not enabled, then the 151 RPCComm is never created and no connection is ever made to the AgentCom and 152 Director. During snapshot processing, the Agent may discover that the Ambassador 153 Module has changed. In that case, the current RPCComm (if it exists) is closed 154 and discarded so that a new one can be created when needed. 155 156 * Snapshot layer 157 158 AES has a simple publish/subscribe mechanism for distributing Watt snapshots 159 throughout the Amb-Sidecar code. It pushes snapshots to subscribers as they come 160 in from Watt, discarding and replacing stale snapshots if they are not consumed. 161 As a result, if the Agent is unable to keep up with incoming snapshots, other 162 AES components will not be blocked or otherwise affected and there will be no 163 backlog. This mechanism has existed for a while; I’m only mentioning it because 164 this is the only non-Agent source for events into the Reporting layer. 165 166 # Communication 167 168 Reporting and retrieving operations share an identity message that includes the 169 account ID, which is how the cloud app identifies this particular Ambassador, 170 and the version of Ambassador, just in case we want to send different commands 171 to different versions. It also includes other information that does not affect 172 the behavior of the Agent or the Director. 173 174 The identity message is constructed from the Ambassador Module received in the 175 Watt snapshot (accounting for the Ambassador ID for this Ambassador). This code 176 cannot return an identity if the Agent is not enabled. The lack of an identity 177 short-circuits further evaluation of the snapshot, which means no report is 178 generated, no reporting happens, and no connection is initiated to the Director. 179 180 Reports to the Director also include a list of Service messages, which are 181 essentially stripped-down copies of Kubernetes Services manifests. The message 182 includes the name, namespace, and labels of the service, as well as the subset 183 of the annotations that have keys starting with app.getambassador.io. 184 185 The Agent retrieves and executes directives. Each directive includes a list of 186 commands. We could stream commands individually, but doing so in batches allows 187 for basic all-or-nothing communication. Each directive can also have two flow 188 control fields to allow the Director to adjust the Agent’s rate of reporting or 189 turn it off entirely. This allows the Director to force some or all Agents to 190 slow down their rate of reporting if cloud service is overwhelmed. The minimum 191 report period is implemented on the Agent side by sleeping in the RPC Goroutine 192 after the RPC completes; the Agent won’t launch a new RPC until that Goroutine 193 finishes and returns a result. 194 195 # Interesting Cases 196 197 * Agent is disabled 198 199 When the Agent processes a snapshot, the first thing it does is attempt to 200 construct an identity, which requires pulling the account ID from the Ambassador 201 module. At this point, if the Agent is not enabled or the account ID is not 202 specified, the code will not construct an ID. This short circuits the rest of 203 snapshot processing, which means a new report cannot be generated, and so no 204 reporting is performed. 205 206 If the Agent is disabled right at startup, the above flow will happen with the 207 very first snapshot. Because a report is never generated, the Agent will not 208 even attempt to connect to the Director. 209 210 If the Agent is disabled sometime after startup, the above flow will cause no 211 further reports to be generated. An existing connection to the Director will 212 persist, but if that connection drops, the Agent will not connect again. 213 214 * Heavy load 215 216 The CPU and memory load the Agent can generate is limited by the size of the 217 Watt snapshot, specifically the number of Kubernetes Services. The Agent 218 effectively makes a very shallow copy of the Services in the snapshot, mostly 219 copying references and pointers. If the Agent decides to report, the generated 220 Protobuf/gRPC code must construct a serialized blob of bytes to send, which does 221 end up copying all the strings byte-by-byte, but that blob is short-lived. Other 222 than snapshot processing and reporting, the Agent’s workload is very brief and 223 does very little allocation. 224 225 Different components can fall behind under heavy CPU load (from the Agent, or 226 from other AES components). The reporting layer can fail to process Watt 227 snapshots as fast as they come in. The communications layer can fail to 228 serialize/deserialize reports as fast as they come in. If the network is slow, 229 then the communication layer could fall behind due to slow RPCs. This is all 230 okay, because none of the layers queues up a backlog or tries to do additional 231 work concurrently. Instead, each layer preserves only the most recent result and 232 eventually processes that result, or a subsequent one, in a serial manner. 233 234 * Slow or broken network 235 236 If the network is consistently slow (always, or for a stretch of time), some 237 layers may fall behind, and that is okay, as described above. If the network is 238 inconsistent, the Agent relies on the gRPC library's error reporting. The Agent 239 reacts to all errors in the same way: log the error and try again later. In all 240 cases, that later time is the next time the Agent decides to report. 241 242 # Evolving the project 243 244 Users may run a given release of Ambassador for a very long time after future 245 versions have been released. Datawire may add new features to the AgentCom side 246 of things in the cloud app, or even roll back to older versions as the need 247 arises. Datawire may also choose to turn off the AgentCom side entirely. 248 249 This implementation of the Agent can handle those situations, so if a user 250 decides to run this release for a long time and leave the Agent enabled, they 251 should have no trouble regardless of what Datawire does with its cloud service. 252 If the AgentCom disappears entirely, or the Director loses its current gRPC 253 endpoints for some reason, this Agent’s communication layer will log errors but 254 will otherwise continue to function just fine. A future version of the Director 255 can choose to reject reports from this Agent, but that won’t cause any trouble 256 with this Ambassador. A future version of the Director can send commands that 257 this Agent doesn’t understand; it will simply ignore them thanks to the basic 258 compatibility properties of Protocol Buffers. Similarly, future versions of the 259 Agent can remain compatible with older versions of the Director. 260 261 The current design of the Agent does not take into consideration the fact that 262 multiple Ambassador Pods are likely to be running simultaneously. Every replica 263 runs an Agent that reports to the Director; it is the Director's responsibility 264 to de-duplicate reports as needed. Similarly, every replica executes all 265 directives retrieved. It is safe to do so in the current trivial implementation, 266 but adding commands that modify the cluster state will require considering how 267 to keep Agents from stepping on each other. 268 */ 269 package agent 270