Metrics types
- Count: incresed only values
- Guage: single values that could go up or down
- Histogram: A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.
A histogram with a base metric name of <basename>
exposes multiple time series during a scrape:
- cumulative counters for the observation buckets, exposed as <basename>_bucket{le="<upper inclusive bound>"}
- the total sum of all observed values, exposed as <basename>_sum
- the count of events that have been observed, exposed as <basename>_count
(identical to <basename>_bucket{le="+Inf"}
above)
- Summary Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.
A summary with a base metric name of <basename>
exposes multiple time series during a scrape:
- streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as
<basename>{quantile="<φ>"}
- the total sum of all observed values, exposed as
<basename>_sum
- the count of events that have been observed, exposed as
<basename>_count
See histograms and summaries for detailed explanations of φ-quantiles, summary usage, and differences to histograms.
Histogram Operations
- Average request duration To calculate the average request duration during the last 5 minutes from a histogram or summary called http_request_duration_seconds, use the following expression:
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
- Percentage of requests served within 300ms
The following expression calculates it by job for the requests served in the last 5 minutes
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job) / sum(rate(http_request_duration_seconds_count[5m])) by (job)
- Apdex Score (see below) The target request duration is 300ms. The tolerable request duration is 1.2s. The following expression yields the Apdex score for each job over the last 5 minutes:
(sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job) + sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m])) by (job)) / 2
/ sum(rate(http_request_duration_seconds_count[5m])) by (job)
Apdex
wiki: https://en.wikipedia.org/wiki/Apdex Apdex tries to calculate the user satisfaction as a range from 0 (no user satisfied) to 1 (all users satisfied) using the formula:
Apdex_t = \frac{SatisfiedCount + \farc{ToleratingCount}{2}}{TotalSamples}
Example: if there are 100 samples with a target time of 3 seconds, where 60 are below 3 seconds, 30 are between 3 and 12 seconds, and the remaining 10 are above 12 seconds, the Apdex score is:
Apdex_3 = (60 + (30/2)) / 100 = 0.75
Google's SLI
sum(rate(http_requests_total{host="api", status!~"5.."}[7d])) / sum(rate(http_requests_total{host="api"}[7d]))
Functions
rate
Rate calculate the per-second increase rate of a
rate(v range-vector)
calculates the per-second average rate of increase of the time series in the range vector (could be generated by [time]
syntax), it handles counter resets correctly and it does extrapolation to handle missing scraps and imperfect alignment.
Example:
Let's say we have the following 2 minutes range vector that we got using: <counter_name>[2m]
1095 @ 1586350105.912
1095 @ 1586350120.912
1095 @ 1586350135.912
1096 @ 1586350150.912
1096 @ 1586350165.912
1096 @ 1586350180.912
1096 @ 1586350195.912
1096 @ 1586350210.912
Calling rate(<counter_name>[2m])
should return per-second the rate by which the <counter_name>
increased which is in this case
\frac{1096 - 1095}{2 * 60} = \frac{1}{120} = 0.008333333
right ? Nah !! remember rate
does extrapolation in case of imperfect alignment 1586350210 (last scrap) - 1586350105 (first scrap) = 105 seconds
but we are calculating rate
over two minutes which is 120 seconds
hence extraplation:
105 seconds -------------------- 1
120 seconds -------------------- x
that gives us x = \frac{120}{105} = 1.142857143
Notes:
- Don't use with guage, it will consider values going down as counter resets
- combining rate()
with an aggregation operator (e.g. sum()
) or a function aggregating over time (any function ending in _over_time
), always take a rate()
first, then aggregate. Otherwise rate()
cannot detect counter resets when your target restarts.
histogram_quantile
Example:
buckets = [{0.1 0} {0.25 5} {0.5 7} {1 10} {+Inf 10}]
q = 0.95
histogram_quantile(q, buckets) = 0.9166666666666667
Why using the 95% quantile ? - TL;DR: It's more realistic to ignore the top 5% exceptional values - Resources: - https://en.wikipedia.org/wiki/Burstable_billing - https://thwack.solarwinds.com/t5/NPM-Discussions/Purpose-of-the-95-percentile-in-the-graph/td-p/275438