Aliaksandr Valialkin
2 min readJun 26, 2019

--

Thanks for the great article!

A few remarks:

  • There is missing aggregate function here:
rate (
prometheus_tsdb_head_samples_appended_total[5m]
) by (pod)

Probably it should be fixed to:

sum (
rate (
prometheus_tsdb_head_samples_appended_total[5m]
)
) by (pod)
  • The following evaluation rule hides Prometheus instances with low uptime:
- record: kube_prometheus:up:max_avg_over_time5m
expr: >
max (
avg_over_time (
up{job="kube-prom"}[5m]
)
)

Suppose you have two Prometheus instances. The first one is always unavailable, i.e. it has up=0 during long periods of time, while the second one is always available, i.e. up=1 all the time. Then max() from the rule above will always return 1 masking unavailable Prometheus. I’d recommend changing the rule to:

- record: kube_prometheus:up:avg_over_time5m
expr: >
min (
avg_over_time (
up{job="kube-prom"}[5m]
)
) by (pod)

This way the average availability over the last 5 minutes will be recorded for each Prometheus pod, so later it could be analyzed for each Prometheus separately. As for the summary availability across all Prometheus instances, I’d recommend using min(kube_prometheus:up:avg_over_time5m), because avg() or max() would mask or hide unavailable pods.

  • The following recording rule would calculate average of average. This breaks stats as Sam Mingo already mentioned:
- record: kube_prometheus:up:max_avg_over_time4w
expr: >
avg_over_time (
kube_prometheus:up:max_avg_over_time5m[4w]
)

There are two possible fixes:

  1. To calculate average uptime over raw data for the last 4 weeks:
- record: kube_prometheus:up:avg_over_time4w
expr: >
min (
avg_over_time (
up{job="kube-prom"}[4w]
)
) by (pod)

2. To calculate two metrics: the number of success scrapes and the number of total scrapes over 5 minutes:

- record: kube_prometheus:up:sum_over_time5m
expr: >
sum (
sum_over_time (
up{job="kube-prom"}[5m]
)
) by (pod)
- record: kube_prometheus:up:count_over_time5m
expr: >
sum (
count_over_time (
up{job="kube-prom"}[5m]
)
) by (pod)

And then sum them over 4 weeks:

- record: kube_prometheus:up:sum_over_time4w
expr: >
sum (
sum_over_time (
kube_prometheus:up:sum_over_time5m[4w]
)
) by (pod)
- record: kube_prometheus:up:count_over_time4w
expr: >
sum (
sum_over_time (
kube_prometheus:up:count_over_time5m[4w]
)
) by (pod)

Then the average uptime over 4 weeks may be calculated as:

kube_prometheus:up:sum_over_time4w
/
kube_prometheus:up:count_over_time4w
  • The following recording rule masks low availability of a single service (say, Thanos) if other services have high uptime:
- record: prometheus:slo:avg_over_time4w
expr: >
(
kube_prometheus:up:max_avg_over_time4w +
prox_prometheus:up:max_avg_over_time4w +
thanos_components:up:max_avg_over_time4w
) / 3

It would be better to determine the minimum availability over the given services with the following expression:

min (
{__name__=~"kube_prometheus:up:max_avg_over_time4w|prox_prometheus:up:max_avg_over_time4w|thanos_components:up:max_avg_over_time4w"}
)

--

--

Aliaksandr Valialkin
Aliaksandr Valialkin

Written by Aliaksandr Valialkin

Founder and core developer at VictoriaMetrics

Responses (1)