Commit 9edfaed6 authored by Ben Kochie's avatar Ben Kochie Committed by GitHub

Reduce the cardinality of health endpoint metrics (#4650)

The health endpoint histogram has a large amount of cardinality for a
simple endpoint. Introduce a new "Slim" set of buckets for `/health` to
reduce the metrics load on large deployments. Especially those that have
per-node DNS caching services.

Add a metric to count internal health check failures rather than use the
timeout value as side effect monitor of the check error. This avoids
incorrectly recording the timeout value if there is an error that is not
a timeout (ex. refused)
Signed-off-by: default avatarSuperQ <superq@gmail.com>
parent 4c0fdc39
...@@ -50,11 +50,13 @@ Doing this is supported but both endpoints ":8080" and ":8081" will export the e ...@@ -50,11 +50,13 @@ Doing this is supported but both endpoints ":8080" and ":8081" will export the e
If monitoring is enabled (via the *prometheus* plugin) then the following metric is exported: If monitoring is enabled (via the *prometheus* plugin) then the following metric is exported:
* `coredns_health_request_duration_seconds{}` - duration to process a HTTP query to the local * `coredns_health_request_duration_seconds{}` - duration to process a HTTP query to the local
`/health` endpoint. As this a local operation it should be fast. A (large) increase in this `/health` endpoint. As this a local operation it should be fast. A (large) increase in this
duration indicates the CoreDNS process is having trouble keeping up with its query load. duration indicates the CoreDNS process is having trouble keeping up with its query load.
* `coredns_health_request_failures_total{}` - The number of times the internal health check loop
failed to query `/health`.
Note that this metric *does not* have a `server` label, because being overloaded is a symptom of Note that these metrics *do not* have a `server` label, because being overloaded is a symptom of
the running process, *not* a specific server. the running process, *not* a specific server.
## Examples ## Examples
......
...@@ -26,7 +26,8 @@ func (h *health) overloaded() { ...@@ -26,7 +26,8 @@ func (h *health) overloaded() {
start := time.Now() start := time.Now()
resp, err := client.Get(url) resp, err := client.Get(url)
if err != nil { if err != nil {
HealthDuration.Observe(timeout.Seconds()) HealthDuration.Observe(time.Since(start).Seconds())
HealthFailures.Inc()
log.Warningf("Local health request to %q failed: %s", url, err) log.Warningf("Local health request to %q failed: %s", url, err)
continue continue
} }
...@@ -49,7 +50,14 @@ var ( ...@@ -49,7 +50,14 @@ var (
Namespace: plugin.Namespace, Namespace: plugin.Namespace,
Subsystem: "health", Subsystem: "health",
Name: "request_duration_seconds", Name: "request_duration_seconds",
Buckets: plugin.TimeBuckets, Buckets: plugin.SlimTimeBuckets,
Help: "Histogram of the time (in seconds) each request took.", Help: "Histogram of the time (in seconds) each request took.",
}) })
// HealthFailures is the metric used to count how many times the thealth request failed
HealthFailures = promauto.NewCounter(prometheus.CounterOpts{
Namespace: plugin.Namespace,
Subsystem: "health",
Name: "request_failures_total",
Help: "The number of times the health check failed.",
})
) )
...@@ -105,5 +105,8 @@ const Namespace = "coredns" ...@@ -105,5 +105,8 @@ const Namespace = "coredns"
// TimeBuckets is based on Prometheus client_golang prometheus.DefBuckets // TimeBuckets is based on Prometheus client_golang prometheus.DefBuckets
var TimeBuckets = prometheus.ExponentialBuckets(0.00025, 2, 16) // from 0.25ms to 8 seconds var TimeBuckets = prometheus.ExponentialBuckets(0.00025, 2, 16) // from 0.25ms to 8 seconds
// SlimTimeBuckets is low cardinality set of duration buckets.
var SlimTimeBuckets = prometheus.ExponentialBuckets(0.00025, 10, 5) // from 0.25ms to 2.5 seconds
// ErrOnce is returned when a plugin doesn't support multiple setups per server. // ErrOnce is returned when a plugin doesn't support multiple setups per server.
var ErrOnce = errors.New("this plugin can only be used once per Server Block") var ErrOnce = errors.New("this plugin can only be used once per Server Block")
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment