Blog posts, guides, glossary, use cases, the product tour, cheat sheets, and the things we wish someone had handed us when we were on-call. The library is growing as we publish; if you can't find something, ask.
Long-form essays on agentic SRE, AIOps, observability, incident response, and the operational practices behind reliable platforms.
GuidesHow to wire Nova into AWS, Slack, GitHub, Grafana, Docker. How to set up your first alert, your first runbook, your first on-call rotation.
GlossaryPlain-English definitions for every term you encounter on-call. MTTR, SLO, error budget, agentic SRE, alert fatigue, and 130 more.
Use CasesTwelve specific buyer pains mapped to Nova features: cut MTTR, eliminate 3am pages, replace Datadog, pass SOC 2, and more.
Product TourSix real product screens walking through detect, triage, diagnose, decide, remediate, outcome. No signup required.
CalculatorPunch in your team size, on-call hours, and tool spend. See the dollar impact of cutting MTTR and consolidating tools.
Sev-1 to Sev-4 with response time, channels, escalation tier, comms cadence, and exit criteria. One table, no committee meeting.
Cheat SheetBudget remaining, burn rate, multi-window thresholds, and the "am I running hot?" check, with the exact PromQL for each.
Cheat SheetEvery kubectl command an on-call engineer reaches for under a 3am page, pods, logs, exec, port-forward, events, top, describe, on a single page.
Cheat SheetDetection, mitigation, resolution, post-mortem: the four customer-facing updates with copy you can paste in at 3am without thinking.
Twelve platforms scored on detection, correlation, automation, post-mortems, and TCO. The clear leaders, and the laggards.
Buyer's GuideDatadog out, Nova in, or whichever direction you're going. The dual-run pattern, the data-portability checklist, and the cutover script.
ComparisonsNova vs Datadog, PagerDuty, BigPanda, Splunk, New Relic, Dynatrace, and 70 more, scored side-by-side on the workflows on-call engineers actually run.
The two pillar guides to AI-driven reliability: AI SRE and Agentic SRE, the architecture for autonomous operations.
AIOps, AI incident response, incident management, root cause analysis, and self-healing infrastructure.
MTTR, SLOs, golden signals, alert fatigue, on-call, runbooks, postmortems, chaos engineering, and toil.
Observability, monitoring, distributed tracing, log management, anomaly detection, Kubernetes, microservices, and AI observability.
DevOps, DevOps automation, platform engineering, CI/CD, infrastructure as code, SRE, capacity planning, and cloud cost optimization.
For teams shipping AI in production: the AI engineer's guide to production reliability and LLMOps.