Cilantrobyte.

Services · Infrastructure

Observability & On-call

The observability and on-call infrastructure that turns production from a source of anxiety into a system you can actually operate.

Category

Infrastructure

Starts with

A scoping call

Status

Booking 2026

(01) Our take

Observability as a practice has drifted toward maximalism over the last few years. Every service instrumented with every possible trace. Every alert tuned to fire on every possible signal. Dashboards with three hundred widgets that nobody reads because nobody can. The result is paradoxical: teams with more observability coverage than ever, and a worse ability to find the thing that’s actually wrong when production breaks.

Our observability work is about the opposite: deliberate coverage, tuned alerts, and dashboards structured around the questions operators actually ask. The default instrumentation we ship is OpenTelemetry for traces and logs — vendor-neutral by design, so the client isn’t locked into a specific backend. The backend is a client choice: Sentry for error tracking (which we consider non-negotiable, not optional), Datadog for the teams that can afford it, Grafana Cloud or a self-hosted Grafana stack for teams that need cost control or data-residency compliance.

Alerting is where most observability setups go wrong. The failure mode is the same everywhere: too many alerts, tuned too sensitively, that fire at 2am for things that don’t matter. The team stops reading alerts. The one alert that matters gets lost in the noise. We tune alerts around a simple rule: if the alert fires, a human should take action, and the action should be worth being woken up for. Alerts that don’t meet that bar get demoted to dashboards or removed entirely. We’d rather ship ten alerts you trust than a hundred you tune out.

On-call is the human half of the work. We help clients set up rotations that don’t punish the same engineer every weekend, runbooks that actually get read, incident-response templates that don’t require composing an incident at 3am, and post-incident reviews that produce systemic fixes rather than blame. For smaller teams, we’ll recommend against a formal on-call rotation and propose a best-effort model with clear escalation — honesty about what the team can sustain beats a rotation that burns people out.

(02) What we build

Typical work

  • OpenTelemetry instrumentation for services, with the backend of the client’s choice
  • Sentry setup with tuned alerts and meaningful issue grouping
  • Datadog or Grafana dashboards structured around real operator questions
  • On-call rotations with escalation paths and genuine runbooks
  • Incident-response processes and post-incident review templates

(03) Is this for you

When to pick this

  • You’ve had an outage and realised you couldn’t tell what was actually wrong.
  • Your on-call rotation exists on paper but the team routes around it.
  • Your alerts have more noise than signal and the team has stopped reading them.
  • You’re scaling and the observability that worked for three services doesn’t work for fifteen.

When not to pick this

  • You haven’t shipped anything to production yet. Add observability when there’s something to observe.
  • You want a specific vendor and you just need implementation help. That’s shorter, more focused work than a full engagement.

(04) Engagement shape

How we engage

4–8 week engagements for focused observability and on-call work. Often paired with Cloud Architecture or CI/CD engagements.

(05) What you walk away with

Deliverable

The headline artefact

Instrumentation, dashboards, alerts, and an on-call process that your team can operate without losing sleep — plus runbooks that explain why each piece exists.

Signature tools we reach for

SentryDatadogOpenTelemetryGrafana

Start a Observability & On-call engagement.