Return Blog
Testing

Full-Stack Observability with Datadog

Sergio Rojas
Sergio Rojas
5 min read 28 Dec, 2025
Share
Full-Stack Observability with Datadog
Summarize with AI:
Prompt copied! Paste it (Cmd/Ctrl+V) in the chat. Open AI

Monitoring tells you that something is wrong. Observability tells you why. That distinction sounds like marketing until it’s 2 a.m., checkout is failing for 3% of users, every dashboard is green, and you have no idea where to look. A wall of CPU graphs won’t save you there. Being able to follow one user’s broken request from their browser click all the way down to the slow database query will.

Most teams bolt on a few dashboards and call it observability. Real full-stack observability is different: it means you can ask a brand-new question of your running system and get an answer without shipping new code or switching between five disconnected tools. Datadog is one of the best platforms for getting there, but only if you wire it up with correlation in mind. Let’s go through how.

Monitoring and observability are not the same thing

Monitoring handles the known unknowns. You already know CPU and error rate matter, so you put them on a dashboard and alert when they cross a line. That’s necessary, but it only answers questions you thought to ask in advance.

Observability handles the unknown unknowns, the failure modes nobody predicted. A truly observable system lets you slice and pivot your data on the fly, asking questions you never anticipated. You don’t get there by collecting more dashboards. You get there by connecting your data so it tells a single story.

The three pillars, and why correlation is the actual prize

You’ve heard the three pillars: metrics tell you how much, logs tell you the detailed what, and traces tell you the path a request took. Each one alone is half-blind. A metric spikes but doesn’t say why. A log line is rich but isolated. A trace shows the journey but not the volume.

The entire value of a platform like Datadog is in linking them. The goal isn’t three separate tools that happen to share a login, it’s the ability to see a latency spike, jump straight to the exact traces behind it, and from one trace pivot to the precise log line that explains the failure. Collection is easy; correlation is the prize.

Trace a request across the whole stack

Distributed tracing is the backbone. A single trace ID follows a request as it moves from the browser, through every backend service, down to the database, so you can see exactly where the time went. On a Node or NestJS backend, the tracer auto-instruments most popular libraries with almost no code:

// tracer.js — must be required before anything else loads
const tracer = require("dd-trace").init({
  service: "checkout-api",
  env: process.env.DD_ENV,
  version: process.env.DD_VERSION,
  logInjection: true, // automatically tie logs to the active trace
});
module.exports = tracer;
// main.ts — the very first line, before any other import
import "./tracer";

That ordering is not optional. The tracer has to load before the libraries it patches, and getting this wrong is the most common reason people stare at empty trace views.

Make your logs part of the trace

This is the step that separates a real setup from a noisy one. With log injection enabled, every structured log line automatically carries the active trace_id and span_id. Suddenly your logs and traces aren’t two separate worlds, they’re two views of the same request:

import pino from "pino";
 
export const logger = pino();
// Because logInjection is on, every line emitted during a request
// carries dd.trace_id and dd.span_id, so any log links straight
// back to its distributed trace, and vice versa.

The payoff is enormous. From a slow trace you jump to its exact logs; from a scary log line you jump to the full request path. No more grepping by timestamp and hoping.

Don’t forget the browser

“Full-stack” includes the part your users actually touch. Datadog’s Real User Monitoring (RUM) browser SDK captures real sessions, frontend errors, and Core Web Vitals, and, critically, connects them to your backend traces:

import { datadogRum } from "@datadog/browser-rum";
 
datadogRum.init({
  applicationId: "<app-id>",
  clientToken: "<client-token>",
  service: "storefront",
  env: "production",
  sessionSampleRate: 100,
  trackUserInteractions: true,
  // The line that makes it full-stack: links browser sessions to backend APM.
  allowedTracingUrls: ["https://api.mystore.com"],
});

With that allowedTracingUrls in place, a slow page in a real user’s session links directly to the backend trace that caused it. That single thread, from rage-click to root cause, is what “full-stack observability” actually means.

Turn data into action without drowning in alerts

All this signal is useless if it just becomes noise. The senior discipline is to alert on symptoms users feel, not on every internal twitch. Define Service Level Objectives, something like “99.9% of checkout requests complete under 500ms”, and alert when you’re burning through that budget, not every time a CPU graph wiggles. An alert that doesn’t map to user pain is an alert your team will eventually learn to ignore.

Best practices that keep it useful

  • Correlate, don't just collect:

The win is linking metrics, traces, and logs into one story, not stockpiling dashboards nobody opens.

  • Tag consistently everywhere:

Use the same env, service, and version tags across all three pillars. Consistent tags are what make cross-pillar pivots actually work.

  • Alert on symptoms, not causes:

Page on user-facing SLO breaches. Leave the low-level metrics for investigation, not for waking people up.

  • Mind the bill:

Observability data gets expensive fast. Sample high-volume traces intelligently and be deliberate about log retention.

Wrapping up

Observability isn’t a dashboard you buy, it’s a property you build into your system on purpose. Instrument your backend tracing first, inject trace context into your logs so the two link automatically, bring the browser into the picture with RUM, and tie it all together with consistent tags and symptom-based SLOs.

Do that, and the next 2 a.m. incident stops being a frantic tool-hopping guessing game and becomes a single, traceable thread you can follow straight to the root cause. That’s the entire reason this work is worth doing.