Monitors
A monitor is a [[monitors]] entry with an id, a name, a kind and a
probing cadence (interval_secs, timeout_secs). Failed probes are retried
once (after 1s) before anything is recorded - override with
probe_retries = 0..5 - so a one-off network blip never pollutes the history
or the error budget. Every failure that survives its retry is logged with its
reason and shown as a tooltip on the status dot (plus last_error in the
API).
http (default)
Section titled “http (default)”target is a URL; up = 2xx, or exactly expected_status if set. TLS
certificate expiry is checked automatically for https:// targets.
[[monitors]]id = "api"name = "API"target = "https://api.example.com/health"interval_secs = 30timeout_secs = 10# expected_status = 200# degraded_over_ms = 800 # up but slower than this = degraded (yellow)# headers = { Authorization = "Bearer ${TOKEN}" }# proxy = "socks5://127.0.0.1:9050"Body assertions turn a reachability check into a correctness check:
keyword = "operational" # body must contain this (keyword_invert = true → must NOT)json_query = "$.status" # JSONPath (RFC 9535) against a JSON bodyjson_expected = "ok" # the queried value must equal this (omit = must match a node)max_body_kb = 256 # cap on the body read for assertions (default 1 MiB)Redirects are followed (up to 10), but configured headers - which may carry credentials - are only re-attached while the redirect stays on the original origin, so a compromised target can’t bounce your API key to another host.
target is host:port; up = the TCP connect succeeds.
[[monitors]]id = "database"name = "Database"kind = "tcp"target = "db.example.com:5432"interval_secs = 60timeout_secs = 5target is a host or IP (no port); up = an echo reply within the timeout.
Uses an unprivileged datagram socket - no CAP_NET_RAW, rootless-Docker
friendly, IPv4 and IPv6. If the socket is unavailable the monitor reports
down with a clear reason naming the sysctl to fix
(net.ipv4.ping_group_range).
target is a hostname to resolve. Without dns_expected any non-empty
answer counts as up - CDN and round-robin answers rotate constantly, so
alerting on mere change would flap. With it, the answer is pinned
(hijack detection), compared order-insensitively:
[[monitors]]id = "dns-check"name = "DNS Check"kind = "dns"target = "example.com"interval_secs = 300dns_record = "A" # A (default), AAAA, CNAME, MX, NS, TXT, SRV, SOA, PTRdns_expected = "1.2.3.4" # comma-separated, order-insensitivedns_resolver = "8.8.8.8:53" # optional; default: system resolverRun any external check following the monitoring-plugins convention: exit
0 = up, 1 = degraded, anything else = down; the first stdout line is the
message (the |perfdata tail is stripped). The whole Nagios/Icinga plugin
ecosystem - thousands of checks: RAID, disks, exotic certificates, SNMP -
becomes usable from Hora.
[[monitors]]id = "raid"name = "RAID"kind = "exec"command = ["check_raid", "--no-sudo"] # a raw argv - no shell, everinterval_secs = 300timeout_secs = 30 # the plugin is SIGKILLed past thisThe security model is the HORA_EXEC_DIR environment variable -
deliberately not a config key. The config is hot-reloaded, so it alone
must never be able to run code: enabling exec probes takes deployment-level
access on top of config access. With the variable set:
command[0]is resolved strictly inside that directory - a bare file name, validated at load, canonicalized at run time (a symlink planted in the directory that points outside is refused);- no shell is involved:
commandis an argv array, so nothing is ever interpolated or injected; - the plugin gets a scrubbed environment (
PATH,HOME,LANG,TZ): the daemon’s own environment carries your notification tokens, and no plugin has any business reading them; - output is bounded and drained (a chatty plugin never deadlocks on a full pipe), and a stuck plugin is killed at the monitor’s timeout.
Without HORA_EXEC_DIR, a config declaring an exec monitor fails at load -
and hora doctor verifies the directory and that every configured plugin is
actually present.
Exec monitors take no target, are excluded from
multi-vantage confirmation (a local
check has no remote vantage), and their failure detail collapses to
“plugin check failed” for anonymous viewers unless public_error_detail
opts in - plugin output names devices and container ids.
push (heartbeat)
Section titled “push (heartbeat)”No target - the job calls Hora. Down when no ping arrives within
interval_secs:
curl -fsS -X POST -H "X-Push-Token: ${BACKUP_TOKEN}" \ "https://status.example.com/api/push/nightly-backup"The cron-aware variant declares when the job runs, and alerts only when a scheduled run misses its grace window - a 03:00 backup pinging at 03:05 is fine, one still silent at 03:30 + grace is down:
[[monitors]]id = "nightly-backup"name = "Nightly backup"kind = "push"interval_secs = 60 # with `schedule`, just the re-evaluation cadencepush_token = "${BACKUP_TOKEN}"schedule = "0 3 * * *" # five-field cron, UTCgrace_secs = 1800 # how late a ping may be (default 30m)Optional query parameters: ?status=up|down|degraded, msg=... (recorded
with the heartbeat), ping=<ms> (round-trip latency).
Dual-stack verification
Section titled “Dual-stack verification”dual_stack = trueprobes IPv4 and IPv6 separately (concurrently) and requires both: the
classic silent failure is a service whose IPv6 has been dead for weeks behind
a healthy IPv4 (or the reverse), invisible to every single-connection check.
One broken family confirms down with the culprit named - “IPv6 failing:
connection timed out (IPv4 ok)” - and the surviving family’s latency
recorded; when both answer, the recorded latency is the slower path’s, so
degraded_over_ms judges the worst case.
Works for http, tcp and icmp monitors with a hostname target (an IP
literal has a single family); cannot be combined with proxy.
Topology: groups and dependencies
Section titled “Topology: groups and dependencies”group = "infra" # display group on the status pagedepends_on = ["db", "cache"] # upstream monitors this one depends onWhen a monitor goes down its alert is annotated with root cause vs. symptom: “caused by X” when an upstream it depends on is also down, or “impacts: A, B, C” (the blast radius) when its upstreams are all healthy and it is the root cause. The dependency graph is validated acyclic at load. Dependencies also drive root-cause alert grouping.
TLS certificates
Section titled “TLS certificates”For https:// monitors, expiry is checked automatically and warned
alerts.cert_expiry_days in advance (default 14). Optionally pin the
public key:
cert_pin = "abc123..." # SHA-256 of the leaf public keyAn unexpected key change - MITM, botched renewal - alerts once per change,
with the old and new fingerprints. Disable expiry checking per monitor with
check_cert = false.
Domain expiry (RDAP)
Section titled “Domain expiry (RDAP)”The natural sibling of the TLS warnings: “your domain expires in 14 days”.
domain_expiry = "example.com"The registered domain (not the subdomain the target uses - that’s why it
is explicit rather than derived) is checked once a day against the registry
over RDAP - JSON over HTTP, no whois parsing - and an alert fires
alerts.domain_expiry_days (default 14) before it expires, with the same
edge-triggered, maintenance-muted policy as certificates.
Visibility
Section titled “Visibility”public = false # hide from unauthenticated viewerspublic_error_detail = true # publish raw failure reasons to anonymous viewersA monitor with public = false disappears from the unauthenticated status
page, API, badges and history (its latency endpoint answers 404, exactly like
a missing monitor); the viewer token (server.auth_token) reveals the full
view. By default anonymous viewers see only a safe category of a failure
(“HTTP 500”, “content check failed”) - the stored detail can carry response
snippets, DNS answers or asserted keywords; public_error_detail = true opts
a monitor into publishing the full reason (and the
failure snapshot).
Latency and SLO fields
Section titled “Latency and SLO fields”degraded_over_ms = 800 # up but slower = degradedslo_latency_ms = 500 # the 24h p95 is flagged met/breached against thisslo_uptime = 99.9 # availability SLO; shows error budget, arms burn-rate alertsslo_window_days = 30 # SLO window (default 30)retention_days = 30 # how long this monitor's raw history is keptSee SLOs & error budgets for how the budget and burn-rate alerts work.