Plain-English definitions, written by SREs, for SREs. Each term links to the part of Nova that handles it in production.
An operating model for site reliability where specialized AI agents own the full incident loop, detect, diagnose, decide, remediate, audit, with human policy as the guardrail.
Artificial Intelligence for IT Operations, the use of ML and statistical models to detect anomalies, correlate alerts, and surface root cause across observability data.
The desensitization that happens when on-call engineers receive so many low-quality alerts that they begin missing the real ones.
A security policy that explicitly permits a known set of inputs (IPs, domains, users), denies everything else by default.
ML-based detection of unusual patterns in metrics, logs, or traces that static thresholds cannot catch.
A single entry point for all client traffic that handles auth, rate limiting, routing, and observability for downstream services.
The closed-loop action where a system detects an issue and resolves it without human intervention, within a policy envelope.
A physically separate datacenter inside a cloud region, the unit of fault isolation cloud architectures use for high availability.
The mechanism by which an overloaded downstream signals upstream to slow down, instead of dropping requests silently.
Monitoring a system from the outside, the way a user sees it, without internal instrumentation, complementary to white-box monitoring.
An incident review focused on systems and processes rather than on any individual's mistake, the convention that produces durable lessons.
The set of services, customers, or regions impacted by a single incident, the operational equivalent of measuring the explosion.
The component that limits the throughput of the whole system, identifying it is the first step of any capacity-planning exercise.
How fast a service is consuming its error budget, the leading indicator that SLO-based alerting fires on.
A release pattern that routes a small fraction of traffic to a new version first, so you find problems before they hit everyone.
The number of unique label combinations a metric can produce, the cost driver and stability risk for time-series databases.
Deliberately injecting failures into production-like systems to discover weaknesses before customers do.
Continuous Integration and Continuous Delivery, the automated pipeline that builds, tests, and ships code to production.
A resilience pattern that stops calling a failing dependency for a cooldown period, so retries don't pile on while the dependency recovers.
When the live state of infrastructure diverges from its declared source-of-truth, the silent precursor to most surprise outages.
A holding queue for messages that failed processing too many times, the safety net that prevents poison messages from blocking a pipeline.
How often a team ships to production, the DORA metric that signals engineering velocity, daily for elite teams, quarterly for low performers.
The plan, infrastructure, and tested runbook that bring a system back from a region-level loss, fire, flood, regional cloud outage.
End-to-end request-level observability that follows one user action across every microservice it touches.
Four engineering-performance metrics, deploy frequency, lead time, MTTR, change-failure rate, validated by DORA's State of DevOps research.
Running compute and caching close to users, at CDN points of presence, instead of only at the origin region.
The acceptable amount of unreliability a service can have before reliability work must take priority over feature work.
An architecture where services communicate by emitting and reacting to events instead of calling each other directly.
A consistency model where reads may see stale data briefly, but all replicas converge given enough time without writes.
A retry strategy that doubles the wait time after each failure, the standard pattern for graceful retry under partial failure.
The automated switch from a failed primary to a standby replica, a cornerstone of high-availability architecture.
A pattern where one upstream request triggers many downstream calls, the classic source of cascading load surprises.
Deliberately introducing failures (latency, errors, dropped packets) into a system to see how it behaves under stress.
A runtime toggle that enables or disables a code path without redeploying, the safest way to ship risky changes.
A root-cause technique that asks 'why?' iteratively until you reach a systemic cause, not a symptomatic one.
An engineer embedded directly with a customer or partner team, the role that turns a platform into a usable product.
A scheduled exercise where a team injects faults into a system and practices the response, the operational drill that keeps DR muscles alive.
A stop-the-world period in managed runtimes (JVM, Go, Node) where the application freezes while memory is reclaimed, the silent latency tax.
Keeping copies of data in multiple regions so reads are fast worldwide and the system survives the loss of any one region.
An operations model where the desired state of every cluster lives in Git, and a controller continuously reconciles reality to it.
Latency, traffic, errors, and saturation, the four metrics Google's SRE book recommends for monitoring any service.
The pattern of falling back to a reduced experience when a dependency fails, instead of failing the whole request.
A small endpoint or probe a service exposes so a load balancer or orchestrator knows if it's ready to receive traffic.
A periodic 'I'm alive' signal a process emits so other systems can detect when it stops, the inverse of polling.
A design property of a system that aims for very low downtime, typically 99.9%+, by removing single points of failure.
A metric that buckets observations by value, the right primitive for tracking latency distributions and computing percentiles.
Adding more replicas of a service to handle more load, the cloud-native scaling model versus making a single replica bigger.
A property that makes the same operation safe to retry, performing it twice yields the same result as performing it once.
The single person who owns the incident from declaration to resolution, the role that prevents responders from stepping on each other.
The classification (Sev-1, Sev-2, ...) that drives response time, channel, escalation, and comms cadence for an incident.
Defining cloud infrastructure (VPCs, instances, IAM, DNS) in version-controlled files so it's reproducible, reviewable, and reversible.
The Kubernetes resource that defines how external HTTP traffic reaches services inside the cluster, a centralized routing rule.
The number of read or write operations a storage device can handle per second, the cap on database performance.
An open-source distributed tracing system, the CNCF reference implementation that most teams start with.
Random variance added to a fixed interval (retries, polling, scheduled jobs) to prevent thundering-herd synchronization.
A scheduled or one-shot task that runs to completion, the workhorse of background processing and periodic maintenance.
A signed, base64-encoded token format that encodes claims about a user, the standard primitive for stateless authentication.
A distributed event-streaming platform, the dominant open-source backbone for high-throughput async messaging and CDC.
A quantitative measure that tracks progress against a strategic goal, the executive-facing metric in any operations dashboard.
The dominant open-source container orchestrator, the substrate most modern SRE work runs on.
A Kubernetes-native config-overlay tool, the way most GitOps teams template environment differences without templating engines.
The time between a request starting and a response arriving, the user-perceived 'is it fast' metric.
The protocol distributed systems use to pick a single coordinator, the foundational primitive behind every consensus and HA system.
The component that distributes incoming traffic across multiple backend replicas, the entry point for any horizontally-scaled service.
Pulling logs from every service into a single searchable system, the foundation of investigation-time observability.
An agent (Fluent Bit, Vector, Filebeat) that collects logs from each host and forwards them to a central aggregator.
A pre-announced period when degraded service is acceptable, the dying art that zero-downtime deploys are replacing.
An architecture style where the system is a fleet of small, independently-deployable services communicating over the network.
The average time a service stays up between failures, an inverse measure of how often it breaks.
The average time between an incident starting and the team being aware of it, the first half of the incident response loop.
The average time from an incident starting to the service being restored to normal, the headline reliability metric for SRE teams.
Running a service across multiple cloud regions for reliability and latency, the next tier above multi-AZ.
A failure mode where part of the network can't reach another part, the worst-case scenario for distributed-systems correctness.
A machine in a cluster, physical or virtual, the unit of capacity Kubernetes and similar orchestrators schedule onto.
The set of techniques (deduplication, correlation, inhibition, severity tiering) that turn thousands of raw alerts into a manageable signal.
When one tenant's heavy usage degrades performance for everyone else on the same shared resource, the cardinal multi-tenancy risk.
The ability to ask new questions about a running system from the outside, without shipping new code, by leaning on logs, metrics, and traces.
The schedule that decides which engineer is responsible for incidents at any given moment, the operational backbone of every SRE team.
The open-source vendor-neutral standard for emitting traces, metrics, and logs, the way most modern observability tools take input.
Coordinating the lifecycle of many components (containers, jobs, workflows) so the system runs as one, the job orchestrators do.
The authoritative server that holds the real data, behind any layer of CDN, edge, or cache.
A period during which a service fails to meet its availability SLO, the event that triggers incident response.
The 99th percentile of response times, the value where exactly 1% of requests are slower, the SLO that captures tail-of-distribution pain.
The escalation tool (PagerDuty, OpsGenie, Incident.io) that wakes the on-call engineer when something needs immediate attention.
The smallest deployable unit in Kubernetes, one or more containers sharing a network namespace and lifecycle.
The structured writeup produced after every Sev-1 (or higher) that captures timeline, root cause, impact, and action items.
The Kubernetes mechanism that asks each pod two questions: are you alive, and are you ready for traffic.
Sampling stack traces from running processes to find what's actually consuming CPU, memory, or I/O time.
The throughput rate of requests against a service, the headline traffic metric in capacity planning and rate-limit design.
The number of items waiting in a queue, a leading indicator of saturation that fires before latency does.
The minimum number of nodes that must agree for a distributed system to make progress, the basis of consensus protocols.
A hard cap on resource consumption per tenant, the multi-tenancy primitive that prevents one user from consuming the whole platform.
Capping the number of requests a client can make per time window, the universal defense against abuse and accidental load.
A copy of a service or data, the building block of horizontal scaling, high availability, and disaster recovery.
A cascading failure where every client retries simultaneously, amplifying load on a downstream that's already struggling.
Reverting a system to a known-good prior state, the fastest path to MTTR when a deploy caused the incident.
The structured process of finding the systemic cause behind an incident, not just the surface symptom.
A written procedure that tells an on-call engineer (or an AI agent) exactly how to respond to a specific kind of incident.
A live, automatically-built diagram of which services call which, with health overlaid, the topological view of an incident's blast radius.
Partitioning data across multiple database instances, the way to scale writes past what a single instance can handle.
The customer-facing contract that promises a level of service, typically with financial credits for breaches, the legal twin of the SLO.
A measurable quantity that describes one specific aspect of a service's behavior, the input to an SLO.
The internal target for an SLI, e.g. '99.9% of requests under 200ms over 30 days', the contract a team holds itself to.
A component whose failure takes down the whole system, the architectural smell HA design exists to eliminate.
Scripted requests run from outside your infrastructure on a schedule, the way you measure 'what does a user actually experience'.
The raw data, logs, metrics, traces, events, that a system emits about itself, the input layer for all of observability.
Slowing or rejecting requests once a client exceeds a rate or quota, the enforcement mechanism behind every rate limit.
The manual, repetitive, automatable work that scales linearly with service growth, the operational debt SRE was invented to eliminate.
The shape of a system, which services exist, how they connect, where they run, the substrate every other observability technique builds on.
How long a piece of data stays valid before being refreshed or expired, the central cache and DNS lever.
Free-form text log lines without machine-readable fields, the legacy format that prevents almost every modern observability technique.
The service that calls another, the one whose request reaches the downstream, the perspective shifts each layer.
The fraction of time a service is in a working state, the simplest reliability number a customer ever asks about.
The sequence of steps a user takes to accomplish a goal, the right granularity for synthetic monitoring and SLO measurement.
How an API evolves over time without breaking existing clients, the discipline that decides who pays the upgrade cost.
Storage that outlives the pod or container that uses it, the bridge between ephemeral compute and persistent state.
A private network connection between two cloud VPCs that lets services communicate without traversing the public internet.
Automated checks that flag known security issues in code, dependencies, container images, or running infrastructure.
The dedicated channel, video bridge, or doc that the responders to a Sev-1 work in together for the duration of the incident.
A user-defined HTTP callback fired by one system to notify another that an event happened, the universal integration glue.
Coordinating multi-step business processes (with retries, branching, time-based steps) so they run reliably end-to-end.
A logical unit of compute that does work, in Kubernetes specifically: Deployments, StatefulSets, DaemonSets, Jobs, CronJobs.
An HTTP header that records the chain of client IPs through proxies, the canonical way to recover the real client IP behind a load balancer.
The cryptographic certificate format underpinning TLS/SSL, the document that proves a server is who it says it is.
The catch-all for cloud-delivered services: SaaS, PaaS, IaaS, FaaS, DBaaS, the consumption model that defines the cloud era.
A unified security platform that correlates signals across endpoints, networks, identity, and cloud, the security equivalent of agentic SRE.
The engineering principle that says don't build for hypothetical future needs, the antidote to over-architected systems.
The whitespace-sensitive serialization format that became the lingua franca of Kubernetes manifests, CI configs, and infrastructure-as-code.
A pinned-dependency manifest (yarn.lock, package-lock.json) that guarantees every install reproduces the same dependency tree.
A hardware security key for phishing-resistant MFA, the gold standard for authenticating SREs into production systems.
A security architecture that assumes no actor (user, service, network) is trusted by default, every request is authenticated and authorized.
A vulnerability that's being exploited in the wild before a patch is available, the worst-case shape of a security incident.
A deploy strategy that releases new code without any window of unavailability, the table-stakes for any production-facing service.
A child process that finished executing but whose exit status hasn't been reaped by its parent, a Linux process-table leak.
A distributed coordination service used for leader election, config management, and locks, the workhorse behind Kafka, HBase, and many others.