RESOURCE LIBRARY

Everything we've written for SREs.

Blog posts, guides, glossary, use cases, the product tour, cheat sheets, and the things we wish someone had handed us when we were on-call. The library is growing as we publish; if you can't find something, ask.

Browse by format

Blog

543+ blog posts

Long-form essays on agentic SRE, AIOps, observability, incident response, and the operational practices behind reliable platforms.

Weekly cadence · 24 categories

Guides

Step-by-step guides

How to wire Nova into AWS, Slack, GitHub, Grafana, Docker. How to set up your first alert, your first runbook, your first on-call rotation.

21 guides · 5-15 min each

Glossary

SRE & AIOps glossary

Plain-English definitions for every term you encounter on-call. MTTR, SLO, error budget, agentic SRE, alert fatigue, and 130 more.

135 terms · A-Z coverage

Use Cases

Job-to-be-done playbooks

Twelve specific buyer pains mapped to Nova features: cut MTTR, eliminate 3am pages, replace Datadog, pass SOC 2, and more.

12 use cases · JTBD format

Product Tour

60-second click-through tour

Six real product screens walking through detect, triage, diagnose, decide, remediate, outcome. No signup required.

~60 sec · keyboard navigable

Calculator

ROI calculator

Punch in your team size, on-call hours, and tool spend. See the dollar impact of cutting MTTR and consolidating tools.

Interactive · ~2 min

Cheat sheets

Cheat Sheet

Alert severity matrix

Sev-1 to Sev-4 with response time, channels, escalation tier, comms cadence, and exit criteria. One table, no committee meeting.

Cheat Sheet

Error budget formulas

Budget remaining, burn rate, multi-window thresholds, and the "am I running hot?" check, with the exact PromQL for each.

Cheat Sheet

kubectl debug cheat sheet

Every kubectl command an on-call engineer reaches for under a 3am page, pods, logs, exec, port-forward, events, top, describe, on a single page.

Cheat Sheet

Incident comms templates

Detection, mitigation, resolution, post-mortem: the four customer-facing updates with copy you can paste in at 3am without thinking.

Buyer's guides

Buyer's Guide

Best AIOps platforms 2026

Twelve platforms scored on detection, correlation, automation, post-mortems, and TCO. The clear leaders, and the laggards.

Buyer's Guide

AIOps migration guide

Datadog out, Nova in, or whichever direction you're going. The dual-run pattern, the data-portability checklist, and the cutover script.

Comparisons

75 head-to-head comparisons

Nova vs Datadog, PagerDuty, BigPanda, Splunk, New Relic, Dynatrace, and 70 more, scored side-by-side on the workflows on-call engineers actually run.

Topic guides

AI SRE & Agentic SRE

The two pillar guides to AI-driven reliability: AI SRE and Agentic SRE, the architecture for autonomous operations.

2 pillar guides

Detect & Resolve

AIOps & incident response

AIOps, AI incident response, incident management, root cause analysis, and self-healing infrastructure.

5 guides

Metrics & Practice

SRE metrics & practices

MTTR, SLOs, golden signals, alert fatigue, on-call, runbooks, postmortems, chaos engineering, and toil.

9 guides

Telemetry

Observability & monitoring

Observability, monitoring, distributed tracing, log management, anomaly detection, Kubernetes, microservices, and AI observability.

8 guides

Delivery & Platform

DevOps & platform engineering

DevOps, DevOps automation, platform engineering, CI/CD, infrastructure as code, SRE, capacity planning, and cloud cost optimization.

8 guides

AI Systems

Reliability for AI systems

For teams shipping AI in production: the AI engineer's guide to production reliability and LLMOps.

2 guides

Webinars & talks

We're not scheduling webinars yet. The first ones will land here when we do. In the meantime, the 60-second product tour walks through the core Nova workflow.

eBooks & analyst reports

No long-form eBooks or analyst placements yet. We'll publish a Founder's Guide to Agentic SRE this quarter. For something specific, email product@novaaiops.com.