Infrastructure · Observability platform

Telemetry that answers questions before the page goes out

We built a real-time platform that ingests millions of events a second and turns them into queries an on-call engineer can actually run. The hard part wasn't storage, it was making the data fast enough to reason with during an incident.

Streaming ingestionTime-seriesHigh-cardinality queryAlerting
one trace idcommit · deploy · alert
The challenge

Their systems emitted plenty of metrics, logs, and traces, but the data landed in three places that never agreed with each other. During an outage, engineers were stitching timelines by hand while the clock ran. They needed one view, fresh to the second, that could answer 'what changed' without a twenty-minute query.

What we built
  • A streaming ingestion pipeline that normalises metrics, logs, and traces into one event model as they arrive, with backpressure so a spike never drops data on the floor.
  • A columnar store tuned for high-cardinality queries, so engineers can group by request, tenant, or deploy without the query timing out.
  • Correlation that links a trace to the deploy, the config change, and the alert that fired, so the causal chain is one click, not a scavenger hunt.
  • Alerting built on the same query path as exploration, so a rule and an investigation never disagree.
The outcome
  • Time to first useful answer during an incident dropped from minutes to seconds.
  • One source of truth replaced three dashboards that used to contradict each other.
  • On-call engineers investigate from the data instead of guessing from a graph.
FAQ

Common questions

Storage is the easy part. The hard part is making data fresh and fast enough to reason with during an incident, while ingesting millions of events a second. We tune ingestion and the query path together so an on-call engineer gets a useful answer in seconds.

We use a columnar store tuned for it, so engineers can group by request, tenant, or deploy without the query falling over. Cardinality that breaks general-purpose dashboards is the normal case here, not an edge case the system quietly refuses to handle.

No. The streaming pipeline applies backpressure, so a spike slows ingestion gracefully instead of dropping events on the floor. Metrics, logs, and traces are normalised into one event model as they arrive, so the data stays consistent even under load.

Alerting is built on the same query path as exploration. A rule and an ad-hoc investigation run against identical data, so they can't quietly disagree. Correlation links a trace to the deploy, config change, and alert that fired, so cause is one click rather than a hunt.

Have a problem shaped like this?

If this looks like the kind of system you need, let's talk through it. First call is always free.

Start a project