DevOps interviews test a wide surface area: CI/CD pipelines, infrastructure as code, containerization, monitoring, cloud services, scripting, and incident response. The depth depends on seniority — juniors get more "how does X work" questions, seniors get "design a system for Y" scenarios.
CI/CD & Automation
1. "Describe your ideal CI/CD pipeline."
Shows your architecture thinking and experience.
Answer: "Code push triggers the pipeline: lint and static analysis → unit tests → build artifact → integration tests → deploy to staging → automated smoke tests → manual QA gate (optional) → deploy to production with canary or blue-green → post-deploy monitoring and automatic rollback if error rate spikes. The goal is fast feedback — developers should know within 10 minutes if their code broke something."
2. "How do you handle a failed deployment in production?"
Incident response under pressure.
Answer: "First: rollback. If we're doing blue-green or canary, shift traffic back to the previous version. If not, deploy the last known good artifact. Second: investigate — check logs, metrics, and the diff of what changed. Third: fix and re-test before deploying again. Fourth: post-mortem to prevent recurrence. Speed of rollback matters more than speed of diagnosis."
3. "What's the difference between continuous integration, continuous delivery, and continuous deployment?"
Fundamentals.
Answer: CI = code is merged and tested automatically on every push. Continuous delivery = code is always in a deployable state, but deployment is a manual decision. Continuous deployment = every change that passes tests automatically deploys to production. Most teams do CI + continuous delivery. True continuous deployment requires very high test coverage and confidence.
Infrastructure
4. "Explain Infrastructure as Code. What tools have you used?"
Answer: "IaC means managing infrastructure through code files rather than manual configuration — version controlled, repeatable, auditable. I've used Terraform for cloud provisioning, Ansible for configuration management, and CloudFormation for AWS-specific resources. The key benefit is reproducibility — I can spin up an identical environment in minutes."
5. "How do you manage secrets and sensitive configuration?"
Security question.
Answer: "Never in code, never in environment variables on disk. I use dedicated secret managers: HashiCorp Vault, AWS Secrets Manager, or GCP Secret Manager. Secrets are injected at runtime, rotated regularly, and access is logged and audited. In CI/CD, I use the pipeline's native secret management (GitHub Secrets, GitLab CI variables) with masking."
6. "Describe your experience with containers and orchestration."
Cover: Docker for containerization, Kubernetes (or ECS/Fargate) for orchestration. Talk about: writing Dockerfiles, multi-stage builds, pod design, scaling, health checks, resource limits, and persistent storage. If you've managed production K8s clusters, describe the scale.
7. "How do you approach cloud cost optimization?"
Increasingly important in DevOps.
Answer: "I use reserved instances for predictable workloads, spot instances for batch processing, right-size instances based on actual usage (not estimates), auto-scale to match demand, set up billing alerts, and run regular cost audits. I've also moved infrequently accessed data to cheaper storage tiers and shut down non-production environments outside business hours."
Monitoring & Reliability
8. "How do you set up monitoring for a production system?"
Answer: "I monitor four layers: infrastructure (CPU, memory, disk, network), application (error rates, latency, throughput), business metrics (orders, sign-ups, revenue), and user experience (page load time, Core Web Vitals). I use Prometheus + Grafana for metrics, ELK or Datadog for logs, PagerDuty for alerting. Alerts should be actionable — if it pages you at 3 AM, there should be a runbook for what to do."
9. "What's the difference between monitoring, observability, and alerting?"
Answer: "Monitoring is tracking known metrics against thresholds. Alerting is notifying when those thresholds are breached. Observability is the ability to understand internal system state from external outputs — so you can debug unknown problems, not just detect known ones. Observability requires structured logging, distributed tracing, and rich metrics."
10. "How do you define and measure SLOs (Service Level Objectives)?"
SRE-focused question.
Answer: "I define SLOs based on user-facing outcomes: availability (99.9% uptime), latency (p95 under 200ms), error rate (less than 0.1%). I measure using real traffic data, not synthetic tests. I set error budgets — if we burn through our error budget, we slow down feature releases and focus on reliability. SLOs are a contract between engineering and the business."
Behavioral
11. "Tell me about a major outage you managed. What happened?"
Show systematic response, not heroics.
Structure: What triggered it → how you detected it → response timeline → resolution → root cause → what you changed to prevent recurrence.
12. "How do you balance new feature work with operational improvements?"
The DevOps version of tech debt.
Answer: "I advocate for a split — typically 70/30 or 80/20 features/reliability. I track toil (manual operational work) and invest in automation that pays for itself. When reliability suffers, I use SLO error budgets to justify slowing feature work."
13. "How do you approach security in your pipelines and infrastructure?"
Shift-left security.
Answer: "I integrate security into the pipeline: dependency scanning (Snyk, Dependabot), container image scanning, SAST/DAST in CI, least-privilege IAM policies, encrypted storage and transit, and regular audits. Security isn't a gate at the end — it's embedded throughout."
14. "What scripting languages do you use?"
Be specific: Bash for quick automation, Python for complex scripts and tooling, Go for CLI tools and performance-sensitive work. Mention any experience with Terraform HCL, YAML for K8s manifests.
15. "What questions do you have for us?"
Ask about: current infrastructure stack, deployment frequency, on-call rotation and incident response process, biggest reliability challenge, and the team's approach to toil reduction.
Want questions tailored to your exact role? Paste the job description at PasteJob and get a personalized cheat sheet in 15 seconds.
Want questions specific to your job listing?
These are generic questions. For questions tailored to your exact role and company — paste your job listing at PasteJob