1# boulder-observer
2
3A modular configuration driven approach to black box monitoring with
4Prometheus.
5
6* [boulder-observer](#boulder-observer)
7 * [Usage](#usage)
8 * [Options](#options)
9 * [Starting the boulder-observer
10 daemon](#starting-the-boulder-observer-daemon)
11 * [Configuration](#configuration)
12 * [Root](#root)
13 * [Schema](#schema)
14 * [Example](#example)
15 * [Monitors](#monitors)
16 * [Schema](#schema-1)
17 * [Example](#example-1)
18 * [Probers](#probers)
19 * [DNS](#dns)
20 * [Schema](#schema-2)
21 * [Example](#example-2)
22 * [HTTP](#http)
23 * [Schema](#schema-3)
24 * [Example](#example-3)
25 * [CRL](#crl)
26 * [Schema](#schema-4)
27 * [Example](#example-4)
28 * [TLS](#tls)
29 * [Schema](#schema-5)
30 * [Example](#example-5)
31 * [Metrics](#metrics)
32 * [Global Metrics](#global-metrics)
33 * [obs_monitors](#obs_monitors)
34 * [obs_observations](#obs_observations)
35 * [CRL Metrics](#crl-metrics)
36 * [obs_crl_this_update](#obs_crl_this_update)
37 * [obs_crl_next_update](#obs_crl_next_update)
38 * [obs_crl_revoked_cert_count](#obs_crl_revoked_cert_count)
39 * [TLS Metrics](#tls-metrics)
40 * [obs_crl_this_update](#obs_tls_not_after)
41 * [obs_crl_next_update](#obs_tls_reason)
42 * [Development](#development)
43 * [Starting Prometheus locally](#starting-prometheus-locally)
44 * [Viewing metrics locally](#viewing-metrics-locally)
45
46## Usage
47
48### Options
49
50```shell
51$ ./boulder-observer -help
52 -config string
53 Path to boulder-observer configuration file (default "config.yml")
54```
55
56### Starting the boulder-observer daemon
57
58```shell
59$ ./boulder-observer -config test/config-next/observer.yml
60I152525 boulder-observer _KzylQI Versions: main=(Unspecified Unspecified) Golang=(go1.16.2) BuildHost=(Unspecified)
61I152525 boulder-observer q_D84gk Initializing boulder-observer daemon from config: test/config-next/observer.yml
62I152525 boulder-observer 7aq68AQ all monitors passed validation
63I152527 boulder-observer yaefiAw kind=[HTTP] success=[true] duration=[0.130097] name=[https://letsencrypt.org-[200]]
64I152527 boulder-observer 65CuDAA kind=[HTTP] success=[true] duration=[0.148633] name=[http://letsencrypt.org/foo-[200 404]]
65I152530 boulder-observer idi4rwE kind=[DNS] success=[false] duration=[0.000093] name=[[2606:4700:4700::1111]:53-udp-A-google.com-recurse]
66I152530 boulder-observer prOnrw8 kind=[DNS] success=[false] duration=[0.000242] name=[[2606:4700:4700::1111]:53-tcp-A-google.com-recurse]
67I152530 boulder-observer 6uXugQw kind=[DNS] success=[true] duration=[0.022962] name=[1.1.1.1:53-udp-A-google.com-recurse]
68I152530 boulder-observer to7h-wo kind=[DNS] success=[true] duration=[0.029860] name=[owen.ns.cloudflare.com:53-udp-A-letsencrypt.org-no-recurse]
69I152530 boulder-observer ovDorAY kind=[DNS] success=[true] duration=[0.033820] name=[owen.ns.cloudflare.com:53-tcp-A-letsencrypt.org-no-recurse]
70...
71```
72
73## Configuration
74
75Configuration is provided via a YAML file.
76
77### Root
78
79#### Schema
80
81`debugaddr`: The Prometheus scrape port prefixed with a single colon
82(e.g. `:8040`).
83
84`buckets`: List of floats representing Prometheus histogram buckets (e.g
85`[.001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10]`)
86
87`syslog`: Map of log levels, see schema below.
88
89- `stdoutlevel`: Log level for stdout, see legend below.
90- `sysloglevel`:Log level for stdout, see legend below.
91
92`0`: *EMERG* `1`: *ALERT* `2`: *CRIT* `3`: *ERR* `4`: *WARN* `5`:
93*NOTICE* `6`: *INFO* `7`: *DEBUG*
94
95`monitors`: List of monitors, see [monitors](#monitors) for schema.
96
97#### Example
98
99```yaml
100debugaddr: :8040
101buckets: [.001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10]
102syslog:
103 stdoutlevel: 6
104 sysloglevel: 6
105 -
106 ...
107```
108
109### Monitors
110
111#### Schema
112
113`period`: Interval between probing attempts (e.g. `1s` `1m` `1h`).
114
115`kind`: Kind of prober to use, see [probers](#probers) for schema.
116
117`settings`: Map of prober settings, see [probers](#probers) for schema.
118
119#### Example
120
121```yaml
122monitors:
123 -
124 period: 5s
125 kind: DNS
126 settings:
127 ...
128```
129
130### Probers
131
132#### DNS
133
134##### Schema
135
136`protocol`: Protocol to use, options are: `udp` or `tcp`.
137
138`server`: Hostname, IPv4 address, or IPv6 address surrounded with
139brackets + port of the DNS server to send the query to (e.g.
140`example.com:53`, `1.1.1.1:53`, or `[2606:4700:4700::1111]:53`).
141
142`recurse`: Bool indicating if recursive resolution is desired.
143
144`query_name`: Name to query (e.g. `example.com`).
145
146`query_type`: Record type to query, options are: `A`, `AAAA`, `TXT`, or
147`CAA`.
148
149##### Example
150
151```yaml
152monitors:
153 -
154 period: 5s
155 kind: DNS
156 settings:
157 protocol: tcp
158 server: [2606:4700:4700::1111]:53
159 recurse: false
160 query_name: letsencrypt.org
161 query_type: A
162```
163
164#### HTTP
165
166##### Schema
167
168`url`: Scheme + Hostname to send a request to (e.g.
169`https://example.com`).
170
171`rcodes`: List of expected HTTP response codes.
172
173`useragent`: String to set HTTP header User-Agent. If no useragent string
174is provided it will default to `letsencrypt/boulder-observer-http-client`.
175
176##### Example
177
178```yaml
179monitors:
180 -
181 period: 2s
182 kind: HTTP
183 settings:
184 url: http://letsencrypt.org/FOO
185 rcodes: [200, 404]
186 useragent: letsencrypt/boulder-observer-http-client
187```
188
189#### CRL
190
191##### Schema
192
193`url`: Scheme + Hostname to grab the CRL from (e.g. `http://x1.c.lencr.org/`).
194
195##### Example
196
197```yaml
198monitors:
199 -
200 period: 1h
201 kind: CRL
202 settings:
203 url: http://x1.c.lencr.org/
204```
205
206#### TLS
207
208##### Schema
209
210`hostname`: Hostname to run TLS check on (e.g. `valid-isrgrootx1.letsencrypt.org`).
211
212`rootOrg`: Organization to check against the root certificate Organization (e.g. `Internet Security Research Group`).
213
214`rootCN`: Name to check against the root certificate Common Name (e.g. `ISRG Root X1`). If not provided, root comparison will be skipped.
215
216`response`: Expected site response; must be one of: `valid`, `revoked` or `expired`.
217
218##### Example
219
220```yaml
221monitors:
222 -
223 period: 1h
224 kind: TLS
225 settings:
226 hostname: valid-isrgrootx1.letsencrypt.org
227 rootOrg: "Internet Security Research Group"
228 rootCN: "ISRG Root X1"
229 response: valid
230```
231
232## Metrics
233
234Observer provides the following metrics.
235
236### Global Metrics
237
238These metrics will always be available.
239
240#### obs_monitors
241
242Count of configured monitors.
243
244**Labels:**
245
246`kind`: Kind of Prober the monitor is configured to use.
247
248`valid`: Bool indicating whether settings provided could be validated
249for the `kind` of Prober specified.
250
251#### obs_observations
252
253**Labels:**
254
255`name`: Name of the monitor.
256
257`kind`: Kind of prober the monitor is configured to use.
258
259`duration`: Duration of the probing in seconds.
260
261`success`: Bool indicating whether the result of the probe attempt was
262successful.
263
264**Bucketed response times:**
265
266This is configurable, see `buckets` under [root/schema](#schema).
267
268### CRL Metrics
269
270These metrics will be available whenever a valid CRL prober is configured.
271
272#### obs_crl_this_update
273
274Unix timestamp value (in seconds) of the thisUpdate field for a CRL.
275
276**Labels:**
277
278`url`: Url of the CRL
279
280**Example Usage:**
281
282This is a sample rule that alerts when a CRL has a thisUpdate timestamp in the future, signalling that something may have gone wrong during its creation:
283
284```yaml
285- alert: CRLThisUpdateInFuture
286 expr: obs_crl_this_update{url="http://x1.c.lencr.org/"} > time()
287 labels:
288 severity: critical
289 annotations:
290 description: 'CRL thisUpdate is in the future'
291```
292
293#### obs_crl_next_update
294
295Unix timestamp value (in seconds) of the nextUpdate field for a CRL.
296
297**Labels:**
298
299`url`: Url of the CRL
300
301**Example Usage:**
302
303This is a sample rule that alerts when a CRL has a nextUpdate timestamp in the past, signalling that the CRL was not updated on time:
304
305```yaml
306- alert: CRLNextUpdateInPast
307 expr: obs_crl_next_update{url="http://x1.c.lencr.org/"} < time()
308 labels:
309 severity: critical
310 annotations:
311 description: 'CRL nextUpdate is in the past'
312```
313
314Another potentially useful rule would be to notify when nextUpdate is within X days from the current time, as a reminder that the update is coming up soon.
315
316#### obs_crl_revoked_cert_count
317
318Count of revoked certificates in a CRL.
319
320**Labels:**
321
322`url`: Url of the CRL
323
324### TLS Metrics
325
326These metrics will be available whenever a valid TLS prober is configured.
327
328#### obs_tls_not_after
329
330Unix timestamp value (in seconds) of the notAfter field for a subscriber certificate.
331
332**Labels:**
333
334`hostname`: Hostname of the site of the subscriber certificate
335
336**Example Usage:**
337
338This is a sample rule that alerts when a site has a notAfter timestamp indicating that the certificate will expire within the next 20 days:
339
340```yaml
341 - alert: CertExpiresSoonWarning
342 annotations:
343 description: "The certificate at {{ $labels.hostname }} expires within 20 days, on: {{ $value | humanizeTimestamp }}"
344 expr: (obs_tls_not_after{hostname=~"^[^e][a-zA-Z]*-isrgrootx[12][.]letsencrypt[.]org"}) <= time() + 1728000
345 for: 60m
346 labels:
347 severity: warning
348```
349
350#### obs_tls_reason
351
352This is a count that increments by one for each resulting reason of a TSL check. The reason is `nil` if the TLS Prober returns `true` and one of the following otherwise: `internalError`, `ocspError`, `rootDidNotMatch`, `responseDidNotMatch`.
353
354**Labels:**
355
356`hostname`: Hostname of the site of the subscriber certificate
357`reason`: The reason for TLS Probe returning false, and `nil` if it returns true
358
359**Example Usage:**
360
361This is a sample rule that alerts when TLS Prober returns false, providing insight on the reason for failure.
362
363```yaml
364 - alert: TLSCertCheckFailed
365 annotations:
366 description: "The TLS probe for {{ $labels.hostname }} failed for reason: {{ $labels.reason }}. This potentially violents CP 2.2."
367 expr: (rate(obs_observations_count{success="false",name=~"[a-zA-Z]*-isrgrootx[12][.]letsencrypt[.]org"}[5m])) > 0
368 for: 5m
369 labels:
370 severity: critical
371```
372
373## Development
374
375### Starting Prometheus locally
376
377Please note, this assumes you've installed a local Prometheus binary.
378
379```shell
380prometheus --config.file=boulder/test/prometheus/prometheus.yml
381```
382
383### Viewing metrics locally
384
385When developing with a local Prometheus instance you can use this link
386to view metrics: [link](http://0.0.0.0:9090)
View as plain text