Metrics types

Count: incresed only values
Guage: single values that could go up or down
Histogram: A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.

A histogram with a base metric name of <basename> exposes multiple time series during a scrape: - cumulative counters for the observation buckets, exposed as <basename>_bucket{le="<upper inclusive bound>"} - the total sum of all observed values, exposed as <basename>_sum - the count of events that have been observed, exposed as <basename>_count (identical to <basename>_bucket{le="+Inf"} above)

Summary Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.

A summary with a base metric name of <basename> exposes multiple time series during a scrape:

streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as <basename>{quantile="<φ>"}
the total sum of all observed values, exposed as <basename>_sum
the count of events that have been observed, exposed as <basename>_count

See histograms and summaries for detailed explanations of φ-quantiles, summary usage, and differences to histograms.

Histogram Operations

Average request duration To calculate the average request duration during the last 5 minutes from a histogram or summary called http_request_duration_seconds, use the following expression:

rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Percentage of requests served within 300ms

The following expression calculates it by job for the requests served in the last 5 minutes

sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job) / sum(rate(http_request_duration_seconds_count[5m])) by (job)

Apdex Score (see below) The target request duration is 300ms. The tolerable request duration is 1.2s. The following expression yields the Apdex score for each job over the last 5 minutes:

(sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job) + sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m])) by (job)) / 2 
/ sum(rate(http_request_duration_seconds_count[5m])) by (job)

Apdex

wiki: https://en.wikipedia.org/wiki/Apdex Apdex tries to calculate the user satisfaction as a range from 0 (no user satisfied) to 1 (all users satisfied) using the formula:

Apdex_t = \frac{SatisfiedCount + \farc{ToleratingCount}{2}}{TotalSamples}

Example: if there are 100 samples with a target time of 3 seconds, where 60 are below 3 seconds, 30 are between 3 and 12 seconds, and the remaining 10 are above 12 seconds, the Apdex score is:

Apdex_3 = (60 + (30/2)) / 100 = 0.75

Google's SLI

sum(rate(http_requests_total{host="api", status!~"5.."}[7d])) / sum(rate(http_requests_total{host="api"}[7d]))

Functions

`rate`

Rate calculate the per-second increase rate of a rate(v range-vector) calculates the per-second average rate of increase of the time series in the range vector (could be generated by [time] syntax), it handles counter resets correctly and it does extrapolation to handle missing scraps and imperfect alignment.

Example: Let's say we have the following 2 minutes range vector that we got using: <counter_name>[2m] 1095 @ 1586350105.912 1095 @ 1586350120.912 1095 @ 1586350135.912 1096 @ 1586350150.912 1096 @ 1586350165.912 1096 @ 1586350180.912 1096 @ 1586350195.912 1096 @ 1586350210.912

Calling rate(<counter_name>[2m]) should return per-second the rate by which the <counter_name> increased which is in this case

\frac{1096 - 1095}{2 * 60} = \frac{1}{120} = 0.008333333

right ? Nah !! remember rate does extrapolation in case of imperfect alignment 1586350210 (last scrap) - 1586350105 (first scrap) = 105 seconds but we are calculating rate over two minutes which is 120 seconds hence extraplation:

105 seconds -------------------- 1
120 seconds -------------------- x

that gives us x = \frac{120}{105} = 1.142857143

Notes: - Don't use with guage, it will consider values going down as counter resets - combining rate() with an aggregation operator (e.g. sum()) or a function aggregating over time (any function ending in _over_time), always take a rate() first, then aggregate. Otherwise rate() cannot detect counter resets when your target restarts.

histogram_quantile

Example:

buckets = [{0.1 0} {0.25 5} {0.5 7} {1 10} {+Inf 10}]
q = 0.95
histogram_quantile(q, buckets) = 0.9166666666666667

Why using the 95% quantile ? - TL;DR: It's more realistic to ignore the top 5% exceptional values - Resources: - https://en.wikipedia.org/wiki/Burstable_billing - https://thwack.solarwinds.com/t5/NPM-Discussions/Purpose-of-the-95-percentile-in-the-graph/td-p/275438