November 2, 2023observabilitymicroservices

Correlation IDs and trace context across services

Propagating request identifiers through gateways, workers, and async jobs so incidents get a single timeline instead of log archaeology.

When a user reports an intermittent failure, answers rarely live in one log file. Correlation identifiers—and, where adopted, W3C Trace Context—tie the browser, edge, API tier, and asynchronous workers into one investigation thread.

Normalize inbound headers

Accept upstream IDs when present; mint one when missing so every request is addressable in logs.

import { randomUUID } from "crypto";

export function correlationIdFrom(req: {
  headers: { get(name: string): string | null };
}): string {
  return (
    req.headers.get("x-request-id")?.trim() ||
    req.headers.get("traceparent")?.split("-")[1]?.slice(0, 16) ||
    randomUUID()
  );
}

Propagate on every outbound hop

await fetch("https://inventory.internal/v1/reserve", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "X-Request-Id": correlationId,
  },
  body: JSON.stringify(payload),
});

Async handoffs

Queue messages should carry the same identifier in an envelope field so consumers attach it to logs and downstream calls—never bury it only inside an opaque payload blob.

Operations culture

Instrumentation without search discipline fails in incidents. Document one query pattern per stack (e.g. Datadog, Loki, CloudWatch Logs Insights) so on-call is not blocked on tribal knowledge.

Takeaway: Distributed systems debug through correlation, not heroics—standardize IDs, propagate them everywhere, and train the team on how to search.

← Back to portfolio