...

Text file src/github.com/letsencrypt/boulder/cmd/boulder-observer/README.md

Documentation: github.com/letsencrypt/boulder/cmd/boulder-observer

     1# boulder-observer
     2
     3A modular configuration driven approach to black box monitoring with
     4Prometheus.
     5
     6* [boulder-observer](#boulder-observer)
     7  * [Usage](#usage)
     8    * [Options](#options)
     9    * [Starting the boulder-observer
    10      daemon](#starting-the-boulder-observer-daemon)
    11  * [Configuration](#configuration)
    12    * [Root](#root)
    13      * [Schema](#schema)
    14      * [Example](#example)
    15    * [Monitors](#monitors)
    16      * [Schema](#schema-1)
    17      * [Example](#example-1)
    18    * [Probers](#probers)
    19      * [DNS](#dns)
    20        * [Schema](#schema-2)
    21        * [Example](#example-2)
    22      * [HTTP](#http)
    23        * [Schema](#schema-3)
    24        * [Example](#example-3)
    25      * [CRL](#crl)
    26        * [Schema](#schema-4)
    27        * [Example](#example-4)
    28      * [TLS](#tls)
    29        * [Schema](#schema-5)
    30        * [Example](#example-5)
    31  * [Metrics](#metrics)
    32    * [Global Metrics](#global-metrics)
    33      * [obs_monitors](#obs_monitors)
    34      * [obs_observations](#obs_observations)
    35    * [CRL Metrics](#crl-metrics)
    36      * [obs_crl_this_update](#obs_crl_this_update)
    37      * [obs_crl_next_update](#obs_crl_next_update)
    38      * [obs_crl_revoked_cert_count](#obs_crl_revoked_cert_count)
    39    * [TLS Metrics](#tls-metrics)
    40      * [obs_crl_this_update](#obs_tls_not_after)
    41      * [obs_crl_next_update](#obs_tls_reason)
    42  * [Development](#development)
    43    * [Starting Prometheus locally](#starting-prometheus-locally)
    44    * [Viewing metrics locally](#viewing-metrics-locally)
    45
    46## Usage
    47
    48### Options
    49
    50```shell
    51$ ./boulder-observer -help
    52  -config string
    53        Path to boulder-observer configuration file (default "config.yml")
    54```
    55
    56### Starting the boulder-observer daemon
    57
    58```shell
    59$ ./boulder-observer -config test/config-next/observer.yml
    60I152525 boulder-observer _KzylQI Versions: main=(Unspecified Unspecified) Golang=(go1.16.2) BuildHost=(Unspecified)
    61I152525 boulder-observer q_D84gk Initializing boulder-observer daemon from config: test/config-next/observer.yml
    62I152525 boulder-observer 7aq68AQ all monitors passed validation
    63I152527 boulder-observer yaefiAw kind=[HTTP] success=[true] duration=[0.130097] name=[https://letsencrypt.org-[200]]
    64I152527 boulder-observer 65CuDAA kind=[HTTP] success=[true] duration=[0.148633] name=[http://letsencrypt.org/foo-[200 404]]
    65I152530 boulder-observer idi4rwE kind=[DNS] success=[false] duration=[0.000093] name=[[2606:4700:4700::1111]:53-udp-A-google.com-recurse]
    66I152530 boulder-observer prOnrw8 kind=[DNS] success=[false] duration=[0.000242] name=[[2606:4700:4700::1111]:53-tcp-A-google.com-recurse]
    67I152530 boulder-observer 6uXugQw kind=[DNS] success=[true] duration=[0.022962] name=[1.1.1.1:53-udp-A-google.com-recurse]
    68I152530 boulder-observer to7h-wo kind=[DNS] success=[true] duration=[0.029860] name=[owen.ns.cloudflare.com:53-udp-A-letsencrypt.org-no-recurse]
    69I152530 boulder-observer ovDorAY kind=[DNS] success=[true] duration=[0.033820] name=[owen.ns.cloudflare.com:53-tcp-A-letsencrypt.org-no-recurse]
    70...
    71```
    72
    73## Configuration
    74
    75Configuration is provided via a YAML file.
    76
    77### Root
    78
    79#### Schema
    80
    81`debugaddr`: The Prometheus scrape port prefixed with a single colon
    82(e.g. `:8040`).
    83
    84`buckets`: List of floats representing Prometheus histogram buckets (e.g
    85`[.001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10]`)
    86
    87`syslog`: Map of log levels, see schema below.
    88
    89- `stdoutlevel`: Log level for stdout, see legend below.
    90- `sysloglevel`:Log level for stdout, see legend below.
    91
    92`0`: *EMERG* `1`: *ALERT* `2`: *CRIT* `3`: *ERR* `4`: *WARN* `5`:
    93*NOTICE* `6`: *INFO* `7`: *DEBUG*
    94
    95`monitors`: List of monitors, see [monitors](#monitors) for schema.
    96
    97#### Example
    98
    99```yaml
   100debugaddr: :8040
   101buckets: [.001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10]
   102syslog:
   103  stdoutlevel: 6
   104  sysloglevel: 6
   105  -
   106    ...
   107```
   108
   109### Monitors
   110
   111#### Schema
   112
   113`period`: Interval between probing attempts (e.g. `1s` `1m` `1h`).
   114
   115`kind`: Kind of prober to use, see [probers](#probers) for schema.
   116
   117`settings`: Map of prober settings, see [probers](#probers) for schema.
   118
   119#### Example
   120
   121```yaml
   122monitors:
   123  - 
   124    period: 5s
   125    kind: DNS
   126    settings:
   127        ...
   128```
   129
   130### Probers
   131
   132#### DNS
   133
   134##### Schema
   135
   136`protocol`: Protocol to use, options are: `udp` or `tcp`.
   137
   138`server`: Hostname, IPv4 address, or IPv6 address surrounded with
   139brackets + port of the DNS server to send the query to (e.g.
   140`example.com:53`, `1.1.1.1:53`, or `[2606:4700:4700::1111]:53`).
   141
   142`recurse`: Bool indicating if recursive resolution is desired.
   143
   144`query_name`: Name to query (e.g. `example.com`).
   145
   146`query_type`: Record type to query, options are: `A`, `AAAA`, `TXT`, or
   147`CAA`.
   148
   149##### Example
   150
   151```yaml
   152monitors:
   153  - 
   154    period: 5s
   155    kind: DNS
   156    settings:
   157      protocol: tcp
   158      server: [2606:4700:4700::1111]:53
   159      recurse: false
   160      query_name: letsencrypt.org
   161      query_type: A
   162```
   163
   164#### HTTP
   165
   166##### Schema
   167
   168`url`: Scheme + Hostname to send a request to (e.g.
   169`https://example.com`).
   170
   171`rcodes`: List of expected HTTP response codes.
   172
   173`useragent`: String to set HTTP header User-Agent. If no useragent string
   174is provided it will default to `letsencrypt/boulder-observer-http-client`.
   175
   176##### Example
   177
   178```yaml
   179monitors:
   180  - 
   181    period: 2s
   182    kind: HTTP
   183    settings:
   184      url: http://letsencrypt.org/FOO
   185      rcodes: [200, 404]
   186      useragent: letsencrypt/boulder-observer-http-client
   187```
   188
   189#### CRL
   190
   191##### Schema
   192
   193`url`: Scheme + Hostname to grab the CRL from (e.g. `http://x1.c.lencr.org/`).
   194
   195##### Example
   196
   197```yaml
   198monitors:
   199  - 
   200    period: 1h
   201    kind: CRL
   202    settings:
   203      url: http://x1.c.lencr.org/
   204```
   205
   206#### TLS
   207
   208##### Schema
   209
   210`hostname`: Hostname to run TLS check on (e.g. `valid-isrgrootx1.letsencrypt.org`).
   211
   212`rootOrg`: Organization to check against the root certificate Organization (e.g. `Internet Security Research Group`).
   213
   214`rootCN`: Name to check against the root certificate Common Name (e.g. `ISRG Root X1`). If not provided, root comparison will be skipped.
   215
   216`response`: Expected site response; must be one of: `valid`, `revoked` or `expired`.
   217
   218##### Example
   219
   220```yaml
   221monitors:
   222  - 
   223    period: 1h
   224    kind: TLS
   225    settings:
   226      hostname: valid-isrgrootx1.letsencrypt.org
   227      rootOrg: "Internet Security Research Group"
   228      rootCN: "ISRG Root X1"
   229      response: valid
   230```
   231
   232## Metrics
   233
   234Observer provides the following metrics.
   235
   236### Global Metrics
   237
   238These metrics will always be available.
   239
   240#### obs_monitors
   241
   242Count of configured monitors.
   243
   244**Labels:**
   245
   246`kind`: Kind of Prober the monitor is configured to use.
   247
   248`valid`: Bool indicating whether settings provided could be validated
   249for the `kind` of Prober specified.
   250
   251#### obs_observations
   252
   253**Labels:**
   254
   255`name`: Name of the monitor.
   256
   257`kind`: Kind of prober the monitor is configured to use.
   258
   259`duration`: Duration of the probing in seconds.
   260
   261`success`: Bool indicating whether the result of the probe attempt was
   262successful.
   263
   264**Bucketed response times:**
   265
   266This is configurable, see `buckets` under [root/schema](#schema).
   267
   268### CRL Metrics
   269
   270These metrics will be available whenever a valid CRL prober is configured.
   271
   272#### obs_crl_this_update
   273
   274Unix timestamp value (in seconds) of the thisUpdate field for a CRL.
   275
   276**Labels:**
   277
   278`url`: Url of the CRL
   279
   280**Example Usage:**
   281
   282This is a sample rule that alerts when a CRL has a thisUpdate timestamp in the future, signalling that something may have gone wrong during its creation:
   283
   284```yaml
   285- alert: CRLThisUpdateInFuture
   286  expr: obs_crl_this_update{url="http://x1.c.lencr.org/"} > time()
   287  labels:
   288    severity: critical
   289  annotations:
   290    description: 'CRL thisUpdate is in the future'
   291```
   292
   293#### obs_crl_next_update
   294
   295Unix timestamp value (in seconds) of the nextUpdate field for a CRL.
   296
   297**Labels:**
   298
   299`url`: Url of the CRL
   300
   301**Example Usage:**
   302
   303This is a sample rule that alerts when a CRL has a nextUpdate timestamp in the past, signalling that the CRL was not updated on time:
   304
   305```yaml
   306- alert: CRLNextUpdateInPast
   307  expr: obs_crl_next_update{url="http://x1.c.lencr.org/"} < time()
   308  labels:
   309    severity: critical
   310  annotations:
   311    description: 'CRL nextUpdate is in the past'
   312```
   313
   314Another potentially useful rule would be to notify when nextUpdate is within X days from the current time, as a reminder that the update is coming up soon.
   315
   316#### obs_crl_revoked_cert_count
   317
   318Count of revoked certificates in a CRL.
   319
   320**Labels:**
   321
   322`url`: Url of the CRL
   323
   324### TLS Metrics
   325
   326These metrics will be available whenever a valid TLS prober is configured.
   327
   328#### obs_tls_not_after
   329
   330Unix timestamp value (in seconds) of the notAfter field for a subscriber certificate.
   331
   332**Labels:**
   333
   334`hostname`: Hostname of the site of the subscriber certificate
   335
   336**Example Usage:**
   337
   338This is a sample rule that alerts when a site has a notAfter timestamp indicating that the certificate will expire within the next 20 days:
   339
   340```yaml
   341  - alert: CertExpiresSoonWarning
   342    annotations:
   343      description: "The certificate at {{ $labels.hostname }} expires within 20 days, on: {{ $value | humanizeTimestamp }}"
   344    expr: (obs_tls_not_after{hostname=~"^[^e][a-zA-Z]*-isrgrootx[12][.]letsencrypt[.]org"}) <= time() + 1728000
   345    for: 60m
   346    labels:
   347      severity: warning
   348```
   349
   350#### obs_tls_reason
   351
   352This is a count that increments by one for each resulting reason of a TSL check. The reason is `nil` if the TLS Prober returns `true` and one of the following otherwise: `internalError`, `ocspError`, `rootDidNotMatch`, `responseDidNotMatch`.
   353
   354**Labels:**
   355
   356`hostname`: Hostname of the site of the subscriber certificate
   357`reason`: The reason for TLS Probe returning false, and `nil` if it returns true
   358
   359**Example Usage:**
   360
   361This is a sample rule that alerts when TLS Prober returns false, providing insight on the reason for failure.
   362
   363```yaml
   364  - alert: TLSCertCheckFailed
   365    annotations:
   366      description: "The TLS probe for {{ $labels.hostname }} failed for reason: {{ $labels.reason }}. This potentially violents CP 2.2."
   367    expr: (rate(obs_observations_count{success="false",name=~"[a-zA-Z]*-isrgrootx[12][.]letsencrypt[.]org"}[5m])) > 0
   368    for: 5m
   369    labels:
   370      severity: critical
   371```
   372
   373## Development
   374
   375### Starting Prometheus locally
   376
   377Please note, this assumes you've installed a local Prometheus binary.
   378
   379```shell
   380prometheus --config.file=boulder/test/prometheus/prometheus.yml
   381```
   382
   383### Viewing metrics locally
   384
   385When developing with a local Prometheus instance you can use this link
   386to view metrics: [link](http://0.0.0.0:9090)

View as plain text