Interview Prep · 26 questions

DevOps interview questions & answers

Real questions you'll get at junior and mid-level DevOps / SRE interviews, with concise senior-level answers. Grouped by topic so you can study one area at a time.

Linux & Networking

4

Explain what happens when you type a URL into your browser and press Enter.

DNS resolves the hostname to an IP, the OS opens a TCP connection (often via TLS handshake on 443), the browser sends an HTTP request, the server responds with HTML, and the browser parses it — fetching CSS/JS/images in parallel, building the DOM and CSSOM, then rendering. Caches (browser, DNS, CDN) short-circuit many of these steps.

What's the difference between a process and a thread on Linux?

A process has its own memory space and file descriptors; a thread shares them with sibling threads inside the same process. Linux actually treats them similarly under the hood via clone(2); the difference is which flags are set (CLONE_VM, CLONE_FILES, etc.).

How do you find which process is listening on port 8080?

`ss -ltnp 'sport = :8080'` or `lsof -iTCP:8080 -sTCP:LISTEN -n -P`. Older systems used netstat -tulpn.

What does a load balancer do that DNS round-robin can't?

Health checks, weighted routing, sticky sessions, TLS termination, layer-7 routing (paths/headers), and real-time failover. DNS round-robin returns IPs in order with TTL caching — it has no awareness of backend health.

Docker & Containers

4

What's the difference between a Docker image and a container?

An image is an immutable, layered filesystem snapshot plus metadata (CMD, ENV, etc.). A container is a running instance of that image with a thin writable layer on top and its own namespaces/cgroups.

Why use a multi-stage Dockerfile?

To keep build tools (compilers, dev dependencies) out of the final image. You build in a heavy stage and COPY only the artifacts into a slim runtime stage. Result: smaller, more secure images.

What happens to data written inside a container when it stops?

It's discarded unless written to a volume or bind mount. The writable layer is destroyed when the container is removed.

How does Docker isolate containers from the host?

Linux namespaces (PID, NET, MNT, UTS, IPC, USER) isolate what the container can see, and cgroups limit what it can use (CPU, memory, I/O). Seccomp, AppArmor/SELinux, and capabilities further restrict what it can do.

Kubernetes

4

Explain the difference between a Deployment, a StatefulSet, and a DaemonSet.

Deployment manages stateless replicas with rolling updates. StatefulSet gives each pod a stable identity and persistent storage (databases, queues). DaemonSet runs one pod per node (log shippers, node exporters).

What's the difference between a Service of type ClusterIP, NodePort, and LoadBalancer?

ClusterIP exposes the service inside the cluster only. NodePort opens a static port on every node. LoadBalancer asks the cloud provider for an external LB pointing at NodePorts. In modern setups, Ingress + ClusterIP is the usual pattern.

How do liveness and readiness probes differ?

Liveness restarts a pod when it's broken. Readiness removes it from Service endpoints when it can't serve traffic (but doesn't restart it). Use readiness during slow startup or temporary dependency loss.

What happens when you run `kubectl apply`?

kubectl sends the manifest to the API server, which validates and persists it to etcd. Controllers (e.g. Deployment controller) reconcile actual state toward desired state — creating ReplicaSets and Pods. The scheduler binds pods to nodes; kubelet pulls images and starts containers.

CI/CD

4

What's the difference between continuous delivery and continuous deployment?

Continuous delivery means every change that passes CI is releasable (deploy is a click). Continuous deployment removes the click — every passing change goes to production automatically.

How would you secure secrets in a CI pipeline?

Use the platform's encrypted secret store (GitHub Actions secrets, GitLab CI variables marked masked+protected), inject them only into jobs that need them, never echo them, scope them to environments, and rotate them. For higher trust: OIDC federation to cloud IAM instead of long-lived keys.

How do you keep a pipeline fast as the project grows?

Cache dependencies, parallelize jobs by test shard or package, use change detection (only run affected projects), pre-warm Docker layer cache, and use larger runners for the bottleneck step.

How do you roll back a bad deploy safely?

Keep the previous artifact addressable (image tag, commit SHA). For containers, redeploy the previous tag. For DB schema changes, use expand/contract migrations so old code still works against the new schema.

Terraform & IaC

4

What is Terraform state and why does it matter?

State maps Terraform resources to real-world IDs. Without it, Terraform can't tell what already exists, so it would try to recreate everything. Store it remotely (S3+DynamoDB, GCS, Terraform Cloud) with locking to prevent concurrent corruption.

How do you handle secrets in Terraform?

Don't put them in .tf files. Pull from a secrets manager at apply time (Vault data source, AWS SSM/Secrets Manager), or pass via env vars / TF_VAR_*. Treat state as sensitive — it can contain secrets verbatim.

What's the difference between `terraform plan` and `terraform apply`?

Plan shows what will change without changing anything. Apply executes the plan. In CI, run plan on PRs for review, then apply on merge to main.

When would you choose Pulumi or CDK over Terraform?

When the team strongly prefers a general-purpose language (TypeScript/Python/Go) over HCL, when you need rich abstractions/loops, or when you're already deep in a single cloud (CDK is AWS-native). Terraform wins for multi-cloud, mature provider ecosystem, and operational simplicity.

Monitoring & Reliability

3

What are SLIs, SLOs, and error budgets?

SLI is a measured signal (e.g. % of requests under 300ms). SLO is the target (e.g. 99.5% per 30 days). Error budget is what's left of the allowed unreliability. When the budget is spent, you freeze risky changes.

What's the difference between metrics, logs, and traces?

Metrics are cheap numeric time-series, great for alerting. Logs are discrete events with context, great for debugging. Traces follow a request across services, great for latency analysis. Modern observability uses all three together.

How would you debug a sudden spike in p99 latency?

Start with a dashboard: is it one service or all? Check correlated deploys, infra events, and saturation (CPU, GC, DB connections). Use traces to find the slow span; logs around that timestamp on that instance. Roll back if the spike correlates with a recent change.

Behavioral & Culture

3

Tell me about a time you caused a production incident.

Use STAR: Situation, Task, Action, Result. Be honest about the mistake, show what you did to mitigate, and emphasize the postmortem learnings and the guardrail you added afterwards (test, alert, runbook, automation).

How do you balance shipping fast with reliability?

Frame it as alignment, not tension: small batches, feature flags, progressive rollouts, and error budgets give the team a shared language. When the budget is healthy, ship fast; when it's burning, slow down and pay back risk.

Why DevOps and not pure backend / SRE?

Show a genuine reason — you love the systems thinking, the leverage of automation, the cross-team collaboration — backed by something concrete you've built (a pipeline, an IaC repo, a monitoring stack).