...

Source file src/github.com/datawire/ambassador/v2/pkg/agent/doc.go

Documentation: github.com/datawire/ambassador/v2/pkg/agent

     1  // Datawire-internal note: I've written these docs from the perspective that we
     2  // intend to move this code into Ambassador OSS in the (near) future. Datawire
     3  // folks also have access to the saas_app repository, which contains the
     4  // implementation of the AgentCom and Director. External folks do not, so I've
     5  // glossed over some of the details. Please improve these docs, but keep that
     6  // constraint in mind. -- Abhay (@ark3)
     7  
     8  /*
     9  Package agent implements the Agent component in Ambassador.
    10  
    11  The Agent is responsible for communicating with a cloud service run by Datawire.
    12  It was introduced in AES 1.7.0. Ultimately, the goal is to be able to present a
    13  cloud-based UI similar to the Edge Policy Console, but for now we just want to
    14  display some information about what this AES knows is running in the cluster.
    15  
    16  # Implementation Goals
    17  
    18  Minimal impact when disabled. The Agent is optional. If the user does not turn
    19  on the Agent, the associated code should do very little, thereby having almost
    20  no impact on the rest of Ambassador. This is no different from any other opt-in
    21  feature in Ambassador.
    22  
    23  Tolerant of external factors. When the Agent is enabled, it talks to a cloud
    24  service run by Datawire. This means it is possible for things outside the user’s
    25  cluster, that have nothing to do with the user’s usage of Ambassador, to affect
    26  or even break that installation. Datawire could make a mistake, or there could
    27  be an outage of their infrastructure outside of their control, or... The point
    28  is, things outside the cluster that are not chosen by the user have now become
    29  possible sources of failure for the user’s Ambassador installation. The Agent
    30  must be robust enough to avoid precipitating such failures.
    31  
    32  This is different from other opt-in features, because there is the potential for
    33  external factors to break Ambassador that were not introduced by the user, but
    34  rather by Datawire.
    35  
    36  # Overview
    37  
    38  Datawire runs a microservice called AgentCom that implements the Director gRPC
    39  service. The client for that service is the Agent; it runs in the user’s
    40  Ambassador. To enable the Agent, the user must add configuration in the
    41  Ambassador module, including the Agent’s account ID, which the user must obtain
    42  from the online application.
    43  
    44  If the Agent is enabled, it sends a snapshot of the current state of the cluster
    45  to the Director on startup and whenever things change. This is done via the
    46  Director’s Report method. At present, the report snapshot includes identity
    47  information about this Ambassador and a small amount of information about each
    48  Kubernetes Service known to this Ambassador.
    49  
    50  The Agent also pulls directives from the Director and executes them. This is
    51  done via the Director’s Retrieve method, which establishes a gRPC stream of
    52  Directive messages flowing from the Director to the Agent.
    53  
    54  Each Directive includes some flow control information (to tell the Agent to stop
    55  sending reports or send them less frequently) and a list of commands for the
    56  Agent to execute. In the future, these commands will be the mechanism to allow
    57  the cloud UI to configure Ambassador and the cluster on behalf of the user. For
    58  now, aside from flow control, the only command implemented is to log a short
    59  string to Ambassador's log.
    60  
    61  # Design layers
    62  
    63  * Protocol Buffers for data
    64  
    65  Messages between the Agent and the Director are implemented using Protocol
    66  Buffers (Proto3). Protobuf presents a straightforward story around forward and
    67  backward compatibility. Both endpoints need to be written with the following in
    68  mind: every field in a message is optional; unrecognized fields are ignored.
    69  
    70  This makes it possible to add or remove (really, stop using) fields. If you add
    71  a field, old code simply ignores it when decoding and leaves it unset when
    72  encoding. If you stop using a field, old code will keep setting it when encoding
    73  and see that it is unset when decoding. New code must account for old code
    74  behaving that way, but does not otherwise need to consider message versioning
    75  explicitly.
    76  
    77  Of course, not every field can really be optional. For example, a report from
    78  the Agent is syntactically valid without an account ID, but it is not
    79  semantically meaningful. It is up to the software at the endpoints to report an
    80  error when a message is invalid.
    81  
    82  * gRPC for communication
    83  
    84  By using gRPC for the communication protocol between the Agent and the Director,
    85  we gain a number of well-tested features for relatively low cost. gRPC is built
    86  on HTTP/2, which is generally well-supported in locked-down environments and
    87  works well with Envoy.
    88  
    89  Generated code and the associated library together enable type-safe RPCs from
    90  the Agent to the Director, offering a simple interface for handling
    91  serialization, streaming messages to avoid polling, connection multiplexing,
    92  automatic retries with exponential backoff, and TLS. The generated API is
    93  straightforward imperative, blocking code even though there is a lot of
    94  machinery running concurrently under the hood to make this fast and responsive.
    95  As gRPC is built on top of Protocol Buffers, it has standard error types for
    96  Proto-specific cases such as semantically invalid messages in addition to types
    97  for typical RPC errors.
    98  
    99  * Simple communication layer
   100  
   101  There is a small set of Go code that uses the generated gRPC methods. The
   102  RPCComm Go structure encapsulates the gRPC client state, including its Go
   103  context, and tracks the Goroutine required to handle streaming responses from
   104  the Retrieve call. Once it has been created, the RPCComm communicates with the
   105  rest of the code via Go channels. RPCComm has a wrapper around the Report method
   106  that makes sure the Retrieve call is running.
   107  
   108  * Reporting layer
   109  
   110  The main Agent code has to do several things, and thus is somewhat complicated.
   111  However, it is written in an event-driven manner, and nearly every computation
   112  it performs is contained in a separate function that can be tested
   113  independently. Note that actual test coverage is very thin for now.
   114  
   115  The main loop blocks on Go channels listening for events. When it wakes up, it
   116  handles the received event, reports to the Director if appropriate, and loops.
   117  
   118  The Agent decides to send a report if it is configured to do so, reporting has
   119  not been stopped by the Director, new information is available to send, and no
   120  prior Report RPC is running. It performs the RPC in a separate single-shot
   121  Goroutine to avoid blocking the loop. That Goroutine performs the RPC, sleeps
   122  for a period, and then sends the result of the RPC over a channel as an event to
   123  the main loop.
   124  
   125  The code will not launch multiple RPCs (or Goroutines); it insists on each RPC
   126  finishing before launching a new one. There is no queue of pending reports; the
   127  loop only remembers the most recent report. An RPC error or timeout does not end
   128  the loop; the error is logged and the loop continues. The RPC Goroutine sleeps
   129  after performing the RPC to provide a simple, adjustable rate limit to
   130  reporting. The loop receives the RPC result as an event; that is its indication
   131  that the RPC is done.
   132  
   133  The loop also receives directives as events. The directive is executed right
   134  away in the same Goroutine, so commands must be fast/non-blocking for now. As
   135  the only command available is to log a simple string, this is not a problem.
   136  Directives can also include a flag to tell the Agent to stop reporting and a
   137  duration to modify the reporting rate.
   138  
   139  Finally, the loop receives new Watt snapshots as events. It uses the snapshot,
   140  which includes everything this Ambassador knows about the cluster, to generate a
   141  new report. If the new report is different from the last report that was sent,
   142  the Agent stores the new report as the next one to be sent. The snapshot also
   143  includes the information needed to determine whether the user has enabled the
   144  Agent (in the Ambassador Module). So the Agent must receive and process
   145  snapshots, even if all it discovers is that it is not enabled and doesn’t need
   146  to do anything else.
   147  
   148  Connectivity to the Director is handled by the communication layer described
   149  above. The RPCComm instance is first created when the Agent decides to report.
   150  If the Agent never needs to report, e.g., because it is not enabled, then the
   151  RPCComm is never created and no connection is ever made to the AgentCom and
   152  Director. During snapshot processing, the Agent may discover that the Ambassador
   153  Module has changed. In that case, the current RPCComm (if it exists) is closed
   154  and discarded so that a new one can be created when needed.
   155  
   156  * Snapshot layer
   157  
   158  AES has a simple publish/subscribe mechanism for distributing Watt snapshots
   159  throughout the Amb-Sidecar code. It pushes snapshots to subscribers as they come
   160  in from Watt, discarding and replacing stale snapshots if they are not consumed.
   161  As a result, if the Agent is unable to keep up with incoming snapshots, other
   162  AES components will not be blocked or otherwise affected and there will be no
   163  backlog. This mechanism has existed for a while; I’m only mentioning it because
   164  this is the only non-Agent source for events into the Reporting layer.
   165  
   166  # Communication
   167  
   168  Reporting and retrieving operations share an identity message that includes the
   169  account ID, which is how the cloud app identifies this particular Ambassador,
   170  and the version of Ambassador, just in case we want to send different commands
   171  to different versions. It also includes other information that does not affect
   172  the behavior of the Agent or the Director.
   173  
   174  The identity message is constructed from the Ambassador Module received in the
   175  Watt snapshot (accounting for the Ambassador ID for this Ambassador). This code
   176  cannot return an identity if the Agent is not enabled. The lack of an identity
   177  short-circuits further evaluation of the snapshot, which means no report is
   178  generated, no reporting happens, and no connection is initiated to the Director.
   179  
   180  Reports to the Director also include a list of Service messages, which are
   181  essentially stripped-down copies of Kubernetes Services manifests. The message
   182  includes the name, namespace, and labels of the service, as well as the subset
   183  of the annotations that have keys starting with app.getambassador.io.
   184  
   185  The Agent retrieves and executes directives. Each directive includes a list of
   186  commands. We could stream commands individually, but doing so in batches allows
   187  for basic all-or-nothing communication. Each directive can also have two flow
   188  control fields to allow the Director to adjust the Agent’s rate of reporting or
   189  turn it off entirely. This allows the Director to force some or all Agents to
   190  slow down their rate of reporting if cloud service is overwhelmed. The minimum
   191  report period is implemented on the Agent side by sleeping in the RPC Goroutine
   192  after the RPC completes; the Agent won’t launch a new RPC until that Goroutine
   193  finishes and returns a result.
   194  
   195  # Interesting Cases
   196  
   197  * Agent is disabled
   198  
   199  When the Agent processes a snapshot, the first thing it does is attempt to
   200  construct an identity, which requires pulling the account ID from the Ambassador
   201  module. At this point, if the Agent is not enabled or the account ID is not
   202  specified, the code will not construct an ID. This short circuits the rest of
   203  snapshot processing, which means a new report cannot be generated, and so no
   204  reporting is performed.
   205  
   206  If the Agent is disabled right at startup, the above flow will happen with the
   207  very first snapshot. Because a report is never generated, the Agent will not
   208  even attempt to connect to the Director.
   209  
   210  If the Agent is disabled sometime after startup, the above flow will cause no
   211  further reports to be generated. An existing connection to the Director will
   212  persist, but if that connection drops, the Agent will not connect again.
   213  
   214  * Heavy load
   215  
   216  The CPU and memory load the Agent can generate is limited by the size of the
   217  Watt snapshot, specifically the number of Kubernetes Services. The Agent
   218  effectively makes a very shallow copy of the Services in the snapshot, mostly
   219  copying references and pointers. If the Agent decides to report, the generated
   220  Protobuf/gRPC code must construct a serialized blob of bytes to send, which does
   221  end up copying all the strings byte-by-byte, but that blob is short-lived. Other
   222  than snapshot processing and reporting, the Agent’s workload is very brief and
   223  does very little allocation.
   224  
   225  Different components can fall behind under heavy CPU load (from the Agent, or
   226  from other AES components). The reporting layer can fail to process Watt
   227  snapshots as fast as they come in. The communications layer can fail to
   228  serialize/deserialize reports as fast as they come in. If the network is slow,
   229  then the communication layer could fall behind due to slow RPCs. This is all
   230  okay, because none of the layers queues up a backlog or tries to do additional
   231  work concurrently. Instead, each layer preserves only the most recent result and
   232  eventually processes that result, or a subsequent one, in a serial manner.
   233  
   234  * Slow or broken network
   235  
   236  If the network is consistently slow (always, or for a stretch of time), some
   237  layers may fall behind, and that is okay, as described above. If the network is
   238  inconsistent, the Agent relies on the gRPC library's error reporting. The Agent
   239  reacts to all errors in the same way: log the error and try again later. In all
   240  cases, that later time is the next time the Agent decides to report.
   241  
   242  # Evolving the project
   243  
   244  Users may run a given release of Ambassador for a very long time after future
   245  versions have been released. Datawire may add new features to the AgentCom side
   246  of things in the cloud app, or even roll back to older versions as the need
   247  arises. Datawire may also choose to turn off the AgentCom side entirely.
   248  
   249  This implementation of the Agent can handle those situations, so if a user
   250  decides to run this release for a long time and leave the Agent enabled, they
   251  should have no trouble regardless of what Datawire does with its cloud service.
   252  If the AgentCom disappears entirely, or the Director loses its current gRPC
   253  endpoints for some reason, this Agent’s communication layer will log errors but
   254  will otherwise continue to function just fine. A future version of the Director
   255  can choose to reject reports from this Agent, but that won’t cause any trouble
   256  with this Ambassador. A future version of the Director can send commands that
   257  this Agent doesn’t understand; it will simply ignore them thanks to the basic
   258  compatibility properties of Protocol Buffers. Similarly, future versions of the
   259  Agent can remain compatible with older versions of the Director.
   260  
   261  The current design of the Agent does not take into consideration the fact that
   262  multiple Ambassador Pods are likely to be running simultaneously. Every replica
   263  runs an Agent that reports to the Director; it is the Director's responsibility
   264  to de-duplicate reports as needed. Similarly, every replica executes all
   265  directives retrieved. It is safe to do so in the current trivial implementation,
   266  but adding commands that modify the cluster state will require considering how
   267  to keep Agents from stepping on each other.
   268  */
   269  package agent
   270  

View as plain text