...

Text file src/github.com/google/certificate-transparency-go/trillian/docs/Operation.md

Documentation: github.com/google/certificate-transparency-go/trillian/docs

     1# Operating a CT Log
     2
     3Once a CT log is deployed it needs to be kept operational, particularly if it
     4is expected to be included in Chrome's
     5[list of trusted logs](http://www.certificate-transparency.org/known-logs).
     6
     7Be **warned**: running a CT log is more difficult than running a normal
     8database-backed web site, because of the security properties required from a Log
     9– running a public Log involves a commitment to reliably store all (valid)
    10uploaded certificates and include them in the tree within a specified period.
    11
    12This means that failures that would be recoverable for a normal website –
    13losing tiny amounts of logged data, accidentally re-using keys – will
    14result in the [failure](https://tools.ietf.org/html/rfc6962#section-7.3) of a CT
    15Log.
    16
    17 - [Key Management](#key-management)
    18 - [Temporal Sharding](#temporal-sharding)
    19 - [Alerting](#alerting)
    20 - [Load Testing](#load-testing)
    21 - [Backups](#backups)
    22 - [Troubleshooting](#troubleshooting)
    23 - [Browser Submission](#browser-submission)
    24
    25
    26## Key Management
    27
    28A CT Log is a cryptographic entity that signs data using a
    29[private key](https://tools.ietf.org/html/rfc6962#section-2.1.4).  This key is
    30needed by all of the distributed CTFE instances, but also needs to be kept
    31secure.  In particular:
    32
    33 - The CT Log key must not be re-used for distinct Logs.
    34 - The CT Log key should not be re-used for HTTPS/TLS termination.
    35
    36The corresponding public key is needed in order to register as a Log that is
    37[trusted by browsers](#browser-submission).
    38
    39
    40## Temporal Sharding
    41
    42To prevent unbounded growth of Log instances, it is recommended that a new
    43production Log is set up to be *temporally sharded*: a collection of separate
    44Log instances (each with its own private key) that each accept certificates
    45with a `NotAfter` date in a particular date range (usually a calendar year).
    46
    47The [multi-tenant nature](#ManualDeployment.md#tree-provisioning) of
    48Trillian-based Logs makes this straightforward to deploy; each shard just needs
    49to set the [`not_after_start`, `not_after_limit`) range in the
    50[CTFE configuration files](#ManualDeployment.md#ctfe-configuration).
    51
    52
    53## Alerting
    54
    55The deployment documents include discussion of
    56[monitoring mechanisms](/ManualDeployment.md#monitoring); for reliable
    57operation, this monitoring should be connected to an alerting system that gives
    58enough time for operations staff to respond to problems.
    59
    60This alerting should cover normal operational metrics, such as:
    61 - Rates of errored requests, categorized according to:
    62    - read and write paths
    63    - client-side (4xx) errors and server-side (5xx) errors.
    64 - Latency distribution of requests.
    65 - Task health, CPU and memory usage.
    66
    67However, the alerting should also cover criteria that are specific to running a
    68CT Log.  In particular, the Log issues signed promises to incorporate
    69submissions within a fixed time window (the maximum merge delay, or MMD), and
    70this incorporation relies on a single point of failure (the
    71[signer](ManualDeployment.md#primary-signer-election)).  As such, there are
    72some CT-specific metrics that can also be alerted on:
    73
    74 - The age of the most recent Merkle tree head.
    75 - The size of the current backlog of unmerged submissions.
    76 - Per-log instance counts of primary signer instances (which is normally 1,
    77   can transiently be 0, but should never be > 1).
    78
    79
    80## Load Testing
    81
    82The modern Web PKI operates at a much larger scale than it did just a couple of
    83years ago, and this increase in scale is only likely to accelerate (e.g. with a
    84shift towards shorter certificate expiration times).
    85
    86This means that a live production Log needs to be able to cope with large
    87volumes of submissions, resulting in a tree size of hundreds of millions of
    88certificates (or more!).
    89
    90To confirm that this scale is indeed supported, it's a good idea to run load
    91tests on a Log deployment before launch.  This is typically done in a parallel
    92test environment that is as close to the live environment as possible (being
    93careful not to [re-use test keys](#key-management)).
    94
    95This repository includes a couple of tools to help with this testing.  Firstly,
    96the
    97[`preloader` tool](https://github.com/google/certificate-transparency-go/blob/master/preload/preloader)
    98allows the contents of a source log to be copied into a destination log.  This
    99tool has command-line options to control its parallelism, but is fundamentally
   100a single-process executable.
   101
   102The other load-testing tool is the
   103[`ct_hammer`](https://github.com/google/certificate-transparency-go/blob/master/trillian/integration/ct_hammer),
   104which tests all of the
   105[RFC 6962 entrypoints](https://tools.ietf.org/html/rfc6962#section-4) with both
   106valid and invalid inputs.
   107
   108 - For write-path testing, `ct_hammer` relies on the Log under test being
   109   configured to accept a test root certificate, so that synthetic test
   110   certificates can be submitted.
   111 - For convenience, `ct_hammer` accepts the same format of configuration file
   112   that is used to configure the CTFE.  (However, be careful not to distribute
   113   a CTFE configuration file that includes non-test
   114   [private keys](#key-management).)
   115 - The `--rate_limit` option controls the overall rate limit for the tool.
   116 - Multiple instances of `ct_hammer` can be run in parallel to allow load
   117   testing to be scaled up arbitrarily.
   118
   119These testing tools can also be used to confirm that the Log continues to
   120operate normally while various maintenance activities – software
   121rollouts, machine turndowns, configuration updates, etc. – are in
   122progress.
   123
   124
   125## Backups
   126
   127For most production systems with persistent data, regular backups are
   128recommended.  However, the cryptographic nature of a CT Log means that backups
   129of its data induce a dangerous temptation.
   130
   131The temptation is this: if you have a backup, at some point you will feel the
   132urge to perform a **restore** from backup.  If any data has been accepted for
   133inclusion since that backup (and a signed promise-to-include issued), then
   134restoring the backup is effectively forking the underlying Merkle tree.  This
   135breaks the tree's append-only property – resulting in log
   136disqualification.
   137
   138
   139## Troubleshooting
   140
   141All of the Trillian and CTFE binaries use the
   142[klog](https://github.com/kubernetes/klog) library for logging, so additional
   143diagnostic information can be obtained by modifying the klog options, for
   144example, by enabling `--logtostderr -v 1`.
   145
   146Other useful klog options for debugging specific problems are:
   147
   148 - `--vmodule`: increase the logging level selectively in particular
   149   code files.
   150 - `--log_backtrace_at`: emit a full stack trace at particular logging
   151   statements.
   152
   153Also, the underlying storage system can be queried independently, using the
   154relevant vendor tool:
   155
   156 - For MySQL, the command line client can be used, in combination with the
   157   Trillian
   158   [database schema](https://github.com/google/trillian/blob/master/storage/mysql/storage.sql).
   159 - For Cloud Spanner, the
   160   [console](https://cloud.google.com/spanner/docs/quickstart-console#run_a_query)
   161   can be used, in combination with the
   162   [database schema](https://github.com/google/trillian/blob/master/storage/cloudspanner/spanner.sdl).
   163
   164Obviously, this should be done with **extreme** care for a live database!
   165
   166
   167## Browser Submission
   168
   169Various browser vendors now require Web PKI certificates to be logged in some
   170number of accepted CT logs
   171(e.g. [Chrome](https://github.com/chromium/ct-policy/blob/master/log_policy.md)
   172[Apple](https://support.apple.com/en-gb/HT205280)).
   173
   174Each vendor has its own criteria for admission to the set of accepted Logs,
   175which is beyond the scope of this document.  However, the set of information
   176that is likely to be needed for browser acceptance includes:
   177
   178 - The URL for the Log.
   179 - The public key for the Log.
   180 - The maximum merge delay (MMD) that the Log has committed to.
   181 - Any [temporal shard](#temporal-sharding) ranges.
   182 - The set of accepted root certificates.
   183 - The values of any rate limits on external traffic.

View as plain text