How a 6-Person Startup Runs SRE for 50 Services Using AI Copilots in 2026 - The Stack Stories 2026

How a 6-Person Startup Runs SRE for 50 Services Using AI Copilots in 2026

The exact tool stack, on-call rotation, and incident playbook that let us cut MTTR by 64% without hiring.

Nilesh Kasar
Nilesh KasarCommunity Member
May 9, 2026
7 min read
Startups
1 views

We do not have an SRE. We have Claude.

When my co-founder and I started the company in 2023, every senior advisor told us the same thing: at 30 services, you need a dedicated SRE. We are now at 50 services, six engineers total, no SRE, and our P1 incident MTTR last quarter was 14 minutes. Two years ago it was 39.

The honest answer for how we got here is not heroic discipline. It is that AI-powered DevOps tooling matured faster than our service count grew. This is the operational stack that actually runs us.

The shape of the team

Six engineers. No formal SRE, no DevOps lead. Everyone is on a primary on-call rotation that fires roughly once every six weeks. We deploy on average 24 times a day across the fleet. Last quarter we shipped 1,847 deploys with three rollbacks. The on-call burden, measured by pages outside business hours, is 1.3 per rotation. It used to be 4.

For people who want to think better, not scroll more

Most people consume content. A few use it to gain clarity. Get a curated set of ideas, insights, and breakdowns — that actually help you understand what’s going on.

No noise. No spam. Just signal.

One issue every Tuesday. No spam. Unsubscribe in one click.

The stack

Here is what we run, end to end:

| Layer | Tool | Why it stays | |---|---|---| | Cloud | AWS, mostly Fargate + RDS | Boring, well-understood | | IaC | Terraform 1.9 + Atlantis | Reviewable plans, no surprise drift | | CI/CD | GitHub Actions + Buildkite for hot paths | Speed and isolation | | Observability | Datadog (metrics, traces, logs) | One pane, AI features earn the price | | Incident | PagerDuty + Rootly | Rootly's AI summarizer cut writeups 80% | | AI ops copilot | Datadog Bits AI + Claude Opus 4.7 via internal MCP | The actual force multiplier | | Code review | Greptile + GitHub Copilot Workspace | AI review before human review | | On-call assist | Custom slackbot on Claude 4.7 | Reads runbooks, drafts comms |

Datadog is the boring backbone. The interesting part is the AI layer on top.

What the AI tooling actually does

Three concrete examples from this quarter, each of which would have been impossible or unaffordable two years ago:

1. Incident triage in 90 seconds

When PagerDuty fires, our slackbot does this automatically before a human looks:

  1. Pulls the last 30 minutes of correlated logs and traces from Datadog via API.
  2. Pulls the last 24 hours of deploys for the affected service from GitHub.
  3. Asks Claude Opus 4.7, with the runbook for that service in context, to draft a hypothesis and a first action.
  4. Posts the hypothesis, the suspected commit, and a one-click "rollback" button into the incident channel.

Roughly 70% of the time, the hypothesis is correct or directionally right. Even when it is wrong, it gives the on-call a starting frame. MTTR on simple regressions dropped from a median of 22 minutes to 6.

2. PR review before humans

Greptile reviews every PR for cross-service impact, missing tests, and pattern violations. It catches roughly 40% of issues that human reviewers used to catch. Humans still review every PR, but they review faster because the AI has already flagged the obvious stuff.

We measured this: median time-from-PR-open to merge fell from 6.4 hours to 2.7. Reviewer fatigue, measured by self-reported survey, dropped meaningfully.

3. Postmortem drafts that humans actually finish

Rootly's AI writes the first draft of every postmortem from the incident timeline, slack transcript, and linked telemetry. It is 60% there. The on-call edits and ships it. Postmortem completion rate went from 38% (humans hate writing them) to 94%. The compounding effect on institutional learning is the biggest underrated win.

What it costs

Real numbers from last month:

  • Datadog: $9,400 (the largest line item by far)
  • PagerDuty + Rootly: $1,100
  • Greptile + Copilot Workspace: $640
  • LLM API costs (Claude + a little OpenAI): $480
  • Total operational tooling: about $11,600/month

That looks expensive until you compare it to a single SRE hire at fully loaded $230K/year, or roughly $19K/month. The tooling is cheaper than half a person, and it does not sleep, take vacation, or quit.

What does not work

Three things I have seen other small teams try that we tried and abandoned:

  1. Fully autonomous remediation. Letting an agent rollback or scale infra without a human in the loop. We tried it for one quarter on a non-critical service. It made one bad call that cascaded into a 45-minute outage. Now everything is human-confirmed with one-click actions, not auto-actions.
  2. Replacing dashboards with chat. Asking the AI ops copilot for status instead of looking at dashboards. Sounds great. In practice, dashboards are faster and the chat interface adds latency to a task humans are already efficient at. We use chat for novel investigation, dashboards for routine.
  3. AI-generated runbooks. We tried generating runbooks from code and incident history. The output was plausible but subtly wrong in ways that would mislead a tired on-call at 3 AM. Runbooks remain human-written. The AI reads them, it does not write them.
  4. Aggregating "AIOps" platforms that promise everything. We piloted two of them in 2025. Both were impressive in demos and disappointing in practice — generic correlation engines that surfaced obvious things and missed the subtle ones. The pattern that won was the opposite: a great primary observability tool plus narrow AI helpers wired into the workflow we already had. Composability beat platform.

The runbook format that actually plays well with AI

Once we realized AI tools were the primary readers of our runbooks, we changed the format. Old runbooks were prose. New runbooks are structured Markdown with a strict header convention: a one-line summary, a "Symptoms" section with bullet points, a "Likely causes" ranked list, a "Remediation steps" numbered list with explicit commands, and a "When to escalate" line. The AI triage bot can extract any of these sections deterministically. The on-call human can also scan them faster at 2 AM. Writing for the AI made them better for humans too.

We keep runbooks in the same monorepo as the service, in a /runbooks folder next to the code, with CODEOWNERS pointing to the service team. That last detail matters more than it sounds: runbook drift is the single biggest enemy of AI-augmented on-call. When the runbook lives next to the code and the team that owns the code owns the runbook, drift slows down dramatically.

How we structure the on-call

The shape that emerged after a year of iteration:

  • Primary on-call: full rotation, one week.
  • Secondary: shadows the primary, exists to escalate to specifically.
  • AI assist: not a tier, but a tool the primary uses for triage and comms.
  • Severity gating: only P1 and P2 page out of hours. P3+ wait for business hours, full stop.
  • Runbook discipline: every alert links to a runbook. No runbook, no alert. Enforced in code review.

That last rule is the most boring and most important one. AI tools amplify good runbooks. They do not save you from missing ones.

A word on alert fatigue

AI triage does not fix bad alerts. It makes their cost slightly less painful, which is dangerous because it removes the pressure to fix them. We do a quarterly "alert audit" where every alert that fired in the last 90 days is reviewed for signal-to-noise, and anything below a threshold is killed or rewritten. Our total alert count dropped from 312 to 117 over the past year. Pages dropped roughly proportionally. The AI tools had nothing to do with that win — it was old-fashioned discipline. But without that discipline, the AI layer would have been triaging mostly garbage, and humans would have stopped trusting it.

What this means for you

If you are a small team running more services than your headcount should allow, the playbook in mid-2026 is:

  • Pay for one excellent observability platform. Do not stitch open-source tools together to save money. The hours lost are not worth it.
  • Add an AI triage layer on top. Even a homemade slackbot calling Claude or GPT-5.1 with your runbooks in context will earn its keep within a month.
  • Use AI for first-draft incident writeups, postmortems, and PR reviews. Keep humans on the final call.
  • Do not let any AI tool execute production changes without a human click.
  • Enforce runbook-per-alert. Without that, AI ops tools are guessing.

The phrase I keep coming back to is "AI as a senior engineer's assistant." It is not replacing senior judgment. It is removing the toil between senior judgments. For a small team, that compounds into something that looks, from the outside, like a much larger one.

If I were starting from scratch today with a six-person team and 50 services to operate, I would invest in three things in this order: ruthless alert hygiene, structured runbooks per alert, and a homemade triage slackbot wired to a frontier LLM with those runbooks in context. Everything else — the commercial AI ops tools, the postmortem generators, the AI code review — is genuinely useful but optional. Those three foundations are not. Get them right, and the rest compounds.

FAQ

💡 Key Takeaways

  • When my co-founder and I started the company in 2023, every senior advisor told us the same thing: at 30 services, you need a dedicated SRE.
  • The honest answer for how we got here is not heroic discipline.
  • Six engineers.

Ask AI About This Topic

Get instant answers trained on this exact article.

Frequently Asked Questions

Nilesh Kasar

Nilesh Kasar

Community Member

An active community contributor shaping discussions on Startups.

StartupsCommunity

Enjoying this story?

Get more in your inbox

Join 12,000+ readers who get the best stories delivered daily.

Subscribe to The Stack Stories →

For people who want to think better, not scroll more

Most people consume content. A few use it to gain clarity. Get a curated set of ideas, insights, and breakdowns — that actually help you understand what’s going on.

No noise. No spam. Just signal.

One issue every Tuesday. No spam. Unsubscribe in one click.

The Stack Stories

One thoughtful read, every Tuesday.

Responses

Join the conversation

You need to log in to read or write responses.

No responses yet. Be the first to share your thoughts!