Thanks for the great article!

2 min readJun 26, 2019

A few remarks:

There is missing aggregate function here:

rate (
  prometheus_tsdb_head_samples_appended_total[5m]
) by (pod)

Probably it should be fixed to:

sum (
  rate (
    prometheus_tsdb_head_samples_appended_total[5m]
  )
) by (pod)

The following evaluation rule hides Prometheus instances with low uptime:

- record: kube_prometheus:up:max_avg_over_time5m
  expr: >
    max (
      avg_over_time (
        up{job="kube-prom"}[5m]
      )
    )

Suppose you have two Prometheus instances. The first one is always unavailable, i.e. it has up=0 during long periods of time, while the second one is always available, i.e. up=1 all the time. Then max() from the rule above will always return 1 masking unavailable Prometheus. I’d recommend changing the rule to:

- record: kube_prometheus:up:avg_over_time5m
  expr: >
    min (
      avg_over_time (
        up{job="kube-prom"}[5m]
      )
    ) by (pod)

This way the average availability over the last 5 minutes will be recorded for each Prometheus pod, so later it could be analyzed for each Prometheus separately. As for the summary availability across all Prometheus instances, I’d recommend using min(kube_prometheus:up:avg_over_time5m), because avg() or max() would mask or hide unavailable pods.

The following recording rule would calculate average of average. This breaks stats as Sam Mingo already mentioned:

- record: kube_prometheus:up:max_avg_over_time4w
  expr: >
    avg_over_time (
      kube_prometheus:up:max_avg_over_time5m[4w]
    )

There are two possible fixes:

To calculate average uptime over raw data for the last 4 weeks:

- record: kube_prometheus:up:avg_over_time4w
  expr: >
    min (
      avg_over_time (
        up{job="kube-prom"}[4w]
      )
    ) by (pod)

2. To calculate two metrics: the number of success scrapes and the number of total scrapes over 5 minutes:

- record: kube_prometheus:up:sum_over_time5m
  expr: >
    sum (
      sum_over_time (
        up{job="kube-prom"}[5m]
      )
    ) by (pod)
- record: kube_prometheus:up:count_over_time5m
  expr: >
    sum (
      count_over_time (
        up{job="kube-prom"}[5m]
      )
    ) by (pod)

And then sum them over 4 weeks:

- record: kube_prometheus:up:sum_over_time4w
  expr: >
    sum (
      sum_over_time (
        kube_prometheus:up:sum_over_time5m[4w]
      )
    ) by (pod)
- record: kube_prometheus:up:count_over_time4w
  expr: >
    sum (
      sum_over_time (
        kube_prometheus:up:count_over_time5m[4w]
      )
    ) by (pod)

Then the average uptime over 4 weeks may be calculated as:

kube_prometheus:up:sum_over_time4w
/
kube_prometheus:up:count_over_time4w

The following recording rule masks low availability of a single service (say, Thanos) if other services have high uptime:

- record: prometheus:slo:avg_over_time4w
  expr: >
  (
    kube_prometheus:up:max_avg_over_time4w +
    prox_prometheus:up:max_avg_over_time4w +
    thanos_components:up:max_avg_over_time4w
  ) / 3

It would be better to determine the minimum availability over the given services with the following expression:

min (
{__name__=~"kube_prometheus:up:max_avg_over_time4w|prox_prometheus:up:max_avg_over_time4w|thanos_components:up:max_avg_over_time4w"}
)

Written by Aliaksandr Valialkin

Responses (1)