Back to Blog
integration

Simplify LLM Monitoring with One-Line Integration

January 7, 202621 min read

Simplify LLM Monitoring Setup with One-Line Integration

If you’ve shipped an LLM-backed feature, you’ve likely felt the pain: behavior changes without warning, latency spikes at the worst possible time, costs drift as prompts evolve, and a small schema error quietly breaks downstream workflows. Traditional observability tools don’t map cleanly to prompt- and model-driven behavior. You need monitoring that answers one practical question: is this prompt behaving the same way it did when it was last safe?

This is exactly where a one-line LLM monitoring integration shines. In this guide, we’ll show how to create a simple LLM monitoring setup with minimal LLM monitoring code—literally a single context manager that wraps your LLM call—while still capturing deep telemetry and proactive drift detection. We’ll use the Deadpipe LLM monitoring SDK because it offers easy LLM observability in one line, automatic baselines, and schema validation with the official SDK pattern:


Why LLM Monitoring Is Different

LLM-powered systems don’t behave like classical microservices. Key differences:

  • Determinism and drift: Even with the same prompt and temperature, responses can vary. Updates to model weights, prompt templates, or system instructions can cause silent regressions.
  • Cost dynamics: Token usage and cost depend on prompt composition, tool calls, function outputs, and retries. A seemingly minor prompt tweak can double token consumption.
  • Schema fragility: Structured output (JSON, function/tool calls) can break with subtle formatting changes. Without validation, downstream jobs may fail silently.
  • Latency variability: Models can exhibit bursty latency under load, especially with streaming, function calls, or large contexts.
  • Evaluation complexity: Quality isn’t a single metric; it includes correctness, adherence to schema, safety, persona, and business rules.

You need monitoring that’s prompt-aware, model-aware, schema-aware, and cost-aware—without rewriting your entire stack. A one-line integration makes that realistic for every code path.


What “One-Line Integration” Means

The Deadpipe SDK provides a context manager (Python) or wrapper function (JavaScript/TypeScript) that you put around your existing LLM call. Inside that small block, the SDK automatically:

  • Captures prompt, variables, model, temperature, and tool/function call details
  • Records tokens, cost, latency, retries, and vendor-specific response metadata
  • Validates structured outputs against a schema (optional but recommended)
  • Redacts PII and secrets according to your policy
  • Computes drift against an automatic baseline for the same operation and prompt hash
  • Streams telemetry to the monitoring backend with minimal overhead
  • Tags spans with environment, service, version, and experiment identifiers

Critically, you don’t have to instrument every call site. You can start with one operation (e.g., “order_router”) and expand from there, all with the same pattern.


Quickstart: The 60-Second Setup

  1. Install the SDK:

    • Python: pip install deadpipe
    • Node: npm i deadpipe-js
  2. Configure environment:

    • DEADPIPE_API_KEY: your API key
    • DEADPIPE_ENV: e.g., production, staging, dev
    • DEADPIPE_SERVICE: your service name, e.g., backend-api
    • DEADPIPE_RELEASE: the application version or Git SHA (optional but helpful)
  3. Wrap your LLM call:

    • Python (context manager)
    • Node (async callback wrapper)
  4. Deploy. Baselines auto-create on first successful runs. Alerts are quiet until the SDK sees enough data to be confident.


Python Example: One-Line Context Manager Around OpenAI

Suppose you have a function that returns structured JSON for an order routing workflow. Here’s the minimal change:

# before
from openai import OpenAI
client = OpenAI()

def route_order(order_text: str):
    prompt = f"Route the following order to a team. Respond as JSON: {order_text}"
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.2,
        messages=[{"role":"system", "content":"You are a precise router. Output JSON."},
                  {"role":"user", "content": prompt}]
    )
    return response.choices[0].message.content

Add one line: the Deadpipe context manager.

# after
from openai import OpenAI
from deadpipe import monitor  # one-line integration
from pydantic import BaseModel

client = OpenAI()

class RouteResult(BaseModel):
    destination: str
    priority: int
    reason: str

def route_order(order_text: str):
    prompt = f"Route the following order to a team. Respond as JSON: {order_text}"

    with monitor(
        op="order_router",
        model="gpt-4o-mini",
        schema=RouteResult,
        tags={"tenant": "public", "feature": "router"},
    ):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0.2,
            messages=[{"role":"system", "content":"You are a precise router. Output JSON."},
                      {"role":"user", "content": prompt}]
        )

    # Optionally extract validated JSON directly via helper
    # json_data = monitor.last().json(response)  # alt: returns RouteResult
    return response.choices[0].message.content

What happened here?

  • The with monitor(...) context captures everything inside. You don’t change your OpenAI call.
  • Deadpipe inspects requests/responses and records cost, tokens, latency, and content hashes (safely redacted).
  • If you specify schema=RouteResult, the SDK runs best-effort JSON extraction and validation. Failures are recorded and can alert you before downstream code breaks.
  • op is a human-friendly operation name that ties together baseline, dashboards, and alerts.

Tip: For structured output, use the helper to parse/validate JSON and raise/record a failure if the model deviates.

with monitor(op="order_router", model="gpt-4o-mini", schema=RouteResult) as m:
    response = client.chat.completions.create(...)
    result = m.json(response)  # returns RouteResult, raises on schema mismatch

TypeScript Example: One-Line Wrapper Around OpenAI

// before
import OpenAI from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

export async function summarize(text: string) {
  const res = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      { role: "system", content: "You write tight bullet summaries." },
      { role: "user", content: `Summarize:\n${text}` },
    ],
  });
  return res.choices[0].message.content ?? "";
}

Add one wrapper:

// after
import OpenAI from "openai";
import { monitor } from "deadpipe-js"; // one-line integration
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

export async function summarize(text: string) {
  return await monitor(
    {
      op: "bullet_summary",
      model: "gpt-4o-mini",
      tags: { feature: "summary" },
    },
    async () => {
      const res = await openai.chat.completions.create({
        model: "gpt-4o-mini",
        messages: [
          { role: "system", content: "You write tight bullet summaries." },
          { role: "user", content: `Summarize:\n${text}` },
        ],
      });
      return res.choices[0].message.content ?? "";
    }
  );
}

Notes:

  • monitor takes config and an async function. It times and records everything within.
  • If you specify schema (using zod or JSON schema), Deadpipe validates the return payload.
  • You can return any value; Deadpipe captures the LLM parts and correlates them to the outer function call.

Works With Multiple Providers and Libraries

The one-line wrapper captures calls to major providers and frameworks:

  • Providers: OpenAI, Anthropic, Google (Vertex/GenAI), Azure OpenAI, Cohere, Mistral, together.ai, Ollama, and self-hosted endpoints with OpenAI-compatible APIs
  • Frameworks: LangChain, LlamaIndex, Semantic Kernel, Guidance, LiteLLM, Vercel AI SDK
  • Transport types: non-streaming, server-sent events (SSE), and WebSocket streaming
  • Tooling: function/tool calls, structured output APIs, and tool return traces

In most cases, you don’t need to change your code beyond adding the context manager. The SDK inspects the provider client calls made within the monitored block.


Streaming Responses Without Losing Telemetry

Streaming is a common source of lost metrics. Deadpipe maintains token-by-token traces:

Python (OpenAI streaming):

from deadpipe import monitor
from openai import OpenAI

client = OpenAI()

def stream_chat(messages):
    with monitor(op="chat_stream", model="gpt-4o-realtime", tags={"stream": "yes"}):
        stream = client.chat.completions.create(
            model="gpt-4o-realtime-preview",
            messages=messages,
            stream=True,
        )
        for chunk in stream:
            # Deadpipe captures deltas and latency per chunk
            yield chunk.choices[0].delta.content or ""

Node (Anthropic streaming):

import Anthropic from "@anthropic-ai/sdk";
import { monitor } from "deadpipe-js";

const anthropic = new Anthropic();

export async function streamMessages(messages: Anthropic.Messages.MessageParam[]) {
  return await monitor({ op: "anthropic_stream", model: "claude-3-5-sonnet" }, async () => {
    const stream = await anthropic.messages.stream({
      model: "claude-3-5-sonnet-20241022",
      messages,
      max_tokens: 1024,
    });
    for await (const event of stream) {
      // Deadpipe accounts for streaming tokens, latency, and backpressure
      process.stdout.write(event.delta ?? "");
    }
  });
}

Deadpipe correlates partial tokens to a single span, records first-byte latency, total duration, and per-chunk throughput. If streaming fails mid-way, you still get partial telemetry.


Structured Output and Schema Validation

Schema validation is the biggest quality step you can take with almost no code. With one-line monitoring, you can specify a schema, and the SDK will:

  • Attempt to extract JSON (using robust parsers tolerant to extra text)
  • Validate against a Pydantic model (Python) or zod/JSON schema (TypeScript)
  • Report violations as soft or hard failures depending on your policy
  • Record validation fields for analytics (e.g., which field fails most often)

Example (Python, Pydantic):

from pydantic import BaseModel, Field
from deadpipe import monitor

class Product(BaseModel):
    sku: str
    name: str
    price: float = Field(ge=0)
    tags: list[str]

def generate_product(desc: str) -> Product:
    with monitor(op="product_json", model="gpt-4o", schema=Product, schema_mode="soft") as m:
        resp = client.chat.completions.create(...)
        return m.json(resp, fallback={"sku":"", "name":"", "price":0, "tags":[]})

Example (TypeScript, zod):

import { z } from "zod";
import { monitor } from "deadpipe-js";

const Product = z.object({
  sku: z.string(),
  name: z.string(),
  price: z.number().min(0),
  tags: z.array(z.string()).default([]),
});

async function createProduct(description: string) {
  return await monitor({ op: "product_json", model: "gpt-4o", schema: Product }, async (m) => {
    const res = await openai.chat.completions.create(...);
    return m.json(res, { coerceNumbers: true });
  });
}

Common policies:

  • Hard: Throw on validation failure; increments error rate; blocks downstream.
  • Soft: Record failure; return partial/fallback; emit warning; keeps pipeline running.
  • Retry on fail: Automatically re-ask the model with a corrective system prompt.

Automatic Baselines and Drift Detection

Deadpipe automatically computes baselines for each op + model + prompt hash + environment combination. Baselines include distributions for:

  • Latency: p50, p90, and tail
  • Cost: input_tokens, output_tokens, total cost per call
  • Output shape: JSON schema pass rate, field completeness, average size
  • Behavior: cosine similarity to baseline embedding (optional), classification distribution, tool-call frequencies
  • Errors: vendor/network errors, validation errors, rate-limit and retry counts

Drift detection:

  • Change-point detection on latency and cost using EWMAs and CUSUM
  • Distribution drift with Jensen-Shannon divergence on categorical outputs and binned numeric metrics
  • Semantic drift via embeddings of outputs or key fields
  • Regression from prior “safe” release via release-aware baselines

You’ll receive alerts only when there’s enough evidence (configurable minimum sample size), reducing noise during low traffic periods.


Tagging Experiments, Prompts, and Tenants

You can add tags without changing your request payload:

  • tags={"experiment": "fewshot-v2", "tenant": tenant_id, "region": "us-east-1"}
  • versioning: DEADPIPE_RELEASE set to your Git SHA ties runs to code releases
  • prompt hashing: Deadpipe computes a stable hash of your prompt template (with variable placeholders), so you can track drift caused by template edits vs. data

These tags power slice-and-dice views on dashboards: compare tenants, models, experiments, or releases to spot regressions quickly.


Cost and Latency Tracking You Can Trust

LLM bills can surprise you. Deadpipe includes:

  • Vendor-accurate token counting where providers supply usage; fallback tokenizers where they don’t
  • Cost calculators that account for input/output tokens, tools, and vendor pricing tiers
  • Per-op, per-tenant, and per-model cost breakdowns
  • First-byte and total latency, streaming throughput, and retry overhead
  • Saturation indicators: are you bumping into rate limits, context limits, or tool latency?

Alerts you might configure:

  • Cost per call exceeds baseline by X%
  • Token usage jumps by Y% for operation=“search_rerank”
  • p95 latency > target SLA for 5 minutes
  • Schema failure rate > 1% in production

Privacy, Security, and Redaction

Monitoring does not have to mean leaking data. Deadpipe supports:

  • Built-in redaction rules: emails, phone numbers, credit cards, SSNs, access tokens
  • Custom regex redaction: e.g., order IDs or internal tokens
  • Field-level redaction: e.g., redact user_content but keep metadata
  • Enterprise: self-hosting options, VPC peering, KMS encryption, and PII classification
  • Data TTLs: configure retention period per environment (short in prod, longer in staging)
  • Opt-out fields: mark a variable as do_not_log to skip entirely

Set via environment or code:

export DEADPIPE_REDACT="email,phone,credit_card,access_token"
export DEADPIPE_PII_STRICT=true

Or programmatically:

with monitor(op="support_chat", redact={"user_email": True, "phone": True}):
    ...

Offline Mode, Retries, and Network Failure Handling

A common concern: what if the monitoring backend is unreachable?

Deadpipe includes:

  • Async batching: minimal overhead in hot paths
  • Local fallback queue: writes to disk or memory if network is down
  • Retry with backoff: avoids impacting your API latency
  • Sampling controls: rate-limit spans at the SDK level if needed
  • Circuit breaker: if telemetry backpressure grows, the SDK disables itself gracefully and logs a warning

Env toggles:

export DEADPIPE_DISABLE=false       # enable/disable globally
export DEADPIPE_SAMPLE=0.5          # sample 50% of calls
export DEADPIPE_QUEUE_DIR=/var/tmp/deadpipe

Integrations: LangChain, LlamaIndex, Semantic Kernel

You can continue to use higher-level frameworks.

LangChain:

from deadpipe import monitor
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini")

def run_chain(input: str):
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Be terse"),
        ("user", "{input}")
    ])
    chain = prompt | llm
    with monitor(op="langchain_example", model="gpt-4o-mini"):
        return chain.invoke({"input": input})

LlamaIndex:

from deadpipe import monitor
from llama_index.core import VectorStoreIndex, Document
from llama_index.llms.openai import OpenAI

def answer(question: str, docs: list[str]):
    index = VectorStoreIndex.from_documents([Document(d) for d in docs], llm=OpenAI(model="gpt-4o-mini"))
    with monitor(op="llamaindex_qa", model="gpt-4o-mini"):
        query_engine = index.as_query_engine()
        return query_engine.query(question)

Semantic Kernel (C# pseudo-code):

using Deadpipe;
using Microsoft.SemanticKernel;

var kernel = Kernel.CreateBuilder().Build();

using (DeadpipeMonitor.With(op: "sk_router", model: "gpt-4o-mini"))
{
    var result = await kernel.InvokePromptAsync("Route: {{$input}}", new() { ["input"] = text });
    return result;
}

Tool/Function Calls and Multi-Step Traces

Tool calls add complexity. Deadpipe tracks tool/function invocation counts, payload size, and timing.

Python with function-calling:

with monitor(op="flight_booking", model="gpt-4o", schema=Itinerary) as m:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=[...],  # function schemas
        tool_choice="auto",
    )
    # If the model emitted a function call, Deadpipe records it.
    # Then you can call the tool and send result back.
    if tool_call := resp.choices[0].message.tool_calls:
        tool_result = call_tool(tool_call)  # your code
        followup = client.chat.completions.create(...messages + tool_result...)
        final = m.json(followup)

Node with tool calls (Anthropic):

await monitor({ op: "support_tools", model: "claude-3-opus" }, async () => {
  const res = await anthropic.messages.create({
    model: "claude-3-opus-20240229",
    messages: [{ role: "user", content: "Reset my password" }],
    tools: [{ name: "resetPassword", input_schema: { type: "object", properties: { email: { type: "string" }}}}],
  });
  // Deadpipe records tool suggestions and subsequent tool result exchange
});

The span shows each step with timing and token cost, so you can see where latency accumulates.


Multimodal: Images, Audio, and Vision

Deadpipe records multimodal metadata without storing sensitive binaries:

  • Stores MIME types, counts, and sizes, not raw bytes
  • Optional hashing of images/audio for deduplication without content storage
  • Captures model modality (vision, audio transcribe, audio generate) and cost

Example:

with monitor(op="vision_caption", model="gpt-4o-mini-vision", tags={"modality":"vision"}):
    response = openai.chat.completions.create(
      model="gpt-4o-mini-vision",
      messages=[{"role":"user", "content":[{"type":"input_text","text":"Describe this image"},
                                           {"type":"input_image","image_url": url}]}]
    )

Advanced Configuration in One Place

The one-line wrapper takes optional parameters for fine control:

  • op: required operation name
  • model: model name (helps disambiguate vendor telemetry)
  • schema: pydantic/zod/JSON schema for validation
  • schema_mode: "hard" or "soft"
  • tags: dictionary of extra labels
  • redact: fields to redact
  • timeout: override default timeouts for telemetry shipping
  • sample: per-call sampling override
  • capture: options like capture_prompt_vars=False to reduce cardinality

Example:

with monitor(
    op="invoice_extractor",
    model="gpt-4o",
    schema=Invoice,
    schema_mode="hard",
    tags={"region":"eu", "experiment":"extractor-v3"},
    redact={"customer_email": True, "address": True},
    sample=1.0,
):
    ...

Guardrails in CI/CD: Catch Regressions Before They Ship

Integrate Deadpipe in your test suite:

  • Prompt snapshot tests: run a fixed set of inputs against your prompt and compare schema pass rate and cost to baseline
  • Release gating: block deployment if drift exceeds a threshold
  • Canary: enable monitoring on a small percentage of traffic and promote when healthy

Python example (pytest):

def test_router_snapshot(deadpipe_test):
    with monitor(op="order_router", model="gpt-4o-mini", schema=RouteResult):
        out = route_order("10 red widgets to warehouse A")
    # Assert schema passed and latency/cost within 20% of baseline
    metrics = deadpipe_test.last_metrics()
    assert metrics.schema_pass_rate >= 1.0
    assert metrics.cost_change <= 0.2
    assert metrics.p95_latency_change <= 0.2

GitHub Actions gate:

- name: Deadpipe Drift Check
  run: deadpipe check --op order_router --max-cost-change 0.3 --max-latency-change 0.3 --min-pass-rate 0.99

Dashboards and Alerting

After you deploy, you’ll see:

  • Overview by op: calls, cost, latency, error rate, schema pass
  • Model comparison: latency and cost head-to-head for the same prompt
  • Drift explorer: change-point timeline with contributing factors
  • Prompt baseline: template hash, example inputs, typical outputs
  • Tenant slices: per-customer view to support SLOs

Alerts go to Slack, PagerDuty, email, or webhooks. They include:

  • What changed (metric and magnitude)
  • When it started
  • Suggested root causes (e.g., new release, prompt change, model change)
  • Links to example failing traces

Real-World Patterns: Use Cases You Can Copy

  1. Structured routing

    • op: order_router
    • schema: destination, priority, reason
    • alerts: schema_pass_rate < 0.995, cost_change > 25%
  2. Extractive QA

    • op: invoice_extractor
    • schema: vendor, date, total, currency, line_items
    • baseline drift: when a new vendor format appears, pass rate dips—alerts catch it
  3. RAG search

    • op: search_rerank
    • metrics: average reranked@3 score vs. offline evaluator
    • drift signal: drop in semantic similarity between query and top doc
  4. Safety moderation

    • op: moderation
    • distribution watch: allowed vs. blocked ratio; unexpected spike triggers investigation
  5. Translation

    • op: translation
    • language distribution: auto-detect mismatches, e.g., model answering in source language instead of target
  6. Code generation

    • op: code_synth
    • schema: compileable flag, language, lints
    • integration with unit tests to verify compile success rate trend

Performance Overhead and How to Measure It

Deadpipe is designed for <5 ms overhead per call under normal conditions via:

  • Async, batched emission
  • Zero-copy capture of request/response
  • Content hashing and redaction in-stream
  • Local queue fallback

Measure it yourself:

import time
from deadpipe import monitor

def bench():
    t0 = time.perf_counter()
    with monitor(op="bench", model="gpt-4o-mini"):
        _ = client.chat.completions.create(...)
    return (time.perf_counter() - t0) * 1000

# Run 100 iterations, compare with deadpipe disabled

If you need even lower overhead:

  • Increase sampling: DEADPIPE_SAMPLE=0.2
  • Disable embeddings for semantic drift
  • Reduce tags with high cardinality

Common Pitfalls and How to Avoid Them

  • Double-wrapping: Don’t nest monitor contexts around the same call; you’ll see duplicate spans. Solution: one context per logical operation.
  • Missing environment metadata: Without DEADPIPE_ENV and DEADPIPE_SERVICE, your data will be hard to filter. Set them in all environments.
  • High-cardinality tags: Putting user_id directly as a tag explodes cardinality. Prefer tenant_id or a stable hash prefix.
  • JSON parsing failures: If the model includes prose around JSON, use the schema helpers (m.json) which robustly extract blocks.
  • Streaming not captured: Make sure the stream iteration is inside the monitor block.
  • Retries hide cost: If your client auto-retries, Deadpipe records retries if they’re inside the monitor. Turn on retry logging in your client for correlation.
  • Rate limits: Frequent 429s inflate latency. Configure backoff and consider provider-specific rate plans.
  • Async contexts: In Python, ensure the context spans the awaited call. In Node, pass an async function to monitor.

Example: Migrating a Legacy Prompt With Minimal Code Changes

Legacy function:

def classify_ticket(text: str) -> str:
    resp = client.chat.completions.create(model="gpt-3.5-turbo", messages=[...])
    return resp.choices[0].message.content.strip()

Monitored version with drift control:

from deadpipe import monitor

def classify_ticket(text: str) -> str:
    with monitor(op="ticket_classifier", model="gpt-4o-mini", tags={"legacy":"true"}) as m:
        resp = client.chat.completions.create(model="gpt-4o-mini", messages=[...], temperature=0)
        out = resp.choices[0].message.content.strip()
        m.assert_in(out, {"billing", "technical", "sales"}, mode="soft")
        return out
  • m.assert_in checks output in an allowed set and records a violation if not.
  • Over time, you’ll see drift if the distribution changes (e.g., more “billing” tickets due to a product launch). That’s expected but measurable.

Observability for On-Call: Practical Runbook

When latency spikes:

  1. Check op-level p95 latency. Is it all traffic or a single tenant/model?
  2. Look at retry rate and 429s; switch to a backup model if needed.
  3. Inspect tool calls. Is a downstream API slow?
  4. Compare release baseline to prior: did you change prompt or system message?
  5. Temporarily increase sampling to 1.0 for deeper traces.

When schema failures rise:

  1. Review example failing outputs on the dashboard.
  2. Enable automatic re-ask with a stricter system message.
  3. Add guard phrases or structured output API (e.g., response_format in OpenAI).
  4. Update schema or fallbacks if a new field becomes optional.
  5. Roll back the prompt template if issue correlates with a recent change.

When cost jumps:

  1. Compare input token histogram—did context length increase?
  2. Verify temperature and top_p settings; higher randomness may increase output length.
  3. Check model switch—were you routed to a larger model?
  4. Investigate retries or tool loops that expand conversations.

Prompt Hygiene: Instrumentation Tips

  • Name your ops clearly: “checkout_assistant”, “invoice_extractor”.
  • Use structured prompt templates with placeholders; let Deadpipe hash templates to distinguish versions.
  • Keep system messages stable; change them intentionally and label with tags like experiment=fact_mode.
  • For structured output, include a JSON-only instruction and validate with schema_mode="hard" for critical paths.
  • Track model and provider versions; changes can cause drift without code changes.

Multi-Tenancy and Compliance

For SaaS products:

  • Tag spans with tenant_id and region to respect data boundaries.
  • Configure redaction stricter in production than staging.
  • Use per-tenant budgets: alert when cost exceeds thresholds.
  • Export per-tenant reports weekly from Deadpipe’s cost API.

Compliance tips:

  • Store only metadata with secure hashing for sensitive fields.
  • Use self-hosted or regional endpoints for data residency needs.
  • Document your redaction and retention policies; auditors will ask.

Extending Monitoring Beyond LLM Calls

A full picture includes:

  • Input pre-processing: tokenization, retrieval latency, cache hits
  • Output post-processing: parsers, validators, business rules
  • Downstream effects: job queue sizes, database write rates, user-facing error metrics
  • Human-in-the-loop: annotation time, acceptance/rejection rate

Deadpipe can link custom spans:

from deadpipe import span

with span("retrieval") as s:
    s.set("query", q)
    docs = retriever.search(q)
    s.set("doc_count", len(docs))

with monitor(op="qa", model="gpt-4o"):
    answer = llm(docs, q)

This shows where time is spent and how retrieval quality correlates with LLM performance.


Frequently Asked Questions

  • Do I need to send prompts in plaintext? No. You can redact or mask user content while keeping template structure and variable types. Deadpipe stores content hashes for drift without raw text if you prefer.
  • What about vendor-native analytics? They’re useful but siloed per provider and lack schema validation, cross-model baselines, and app-specific tagging.
  • Will this break my hot path? Overhead is typically a few milliseconds, amortized through async batching. You can sample or disable per-endpoint during incidents.
  • Can I monitor non-LLM AI calls? Yes. Wrap embeddings, rerankers, and classifiers with the same pattern.
  • How do I roll out safely? Start with a low-traffic op or staging, verify baselines, then enable in production. Use alerts with conservative thresholds initially.

End-to-End Example: Email Triage Assistant

Goal: classify incoming emails, summarize, and suggest next actions with tool calls.

Python sketch:

from pydantic import BaseModel
from deadpipe import monitor

class Triage(BaseModel):
    category: str  # billing|tech|sales|spam
    summary: str
    action: str  # reply|escalate|ignore

def triage_email(email_text: str) -> Triage:
    with monitor(
        op="email_triage",
        model="gpt-4o-mini",
        schema=Triage,
        tags={"channel": "email"},
    ) as m:
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0,
            messages=[
                {"role": "system", "content": "Classify, summarize, and propose an action. Respond as JSON."},
                {"role": "user", "content": email_text},
            ],
        )
        triage = m.json(resp)
        m.assert_in(triage.category, {"billing","tech","sales","spam"}, mode="soft")
        return triage
  • Baseline stabilizes after ~200 calls; drift alerts if category distribution shifts abruptly.
  • Monitor cost and latency; if spikes occur, consider switching to a smaller model or adding caching.
  • Add a post-action tool call (e.g., create_ticket) and let Deadpipe capture it as part of the span.

Minimal Maintenance: Keep It to One Line

As your app grows, resist the temptation to sprinkle custom logging everywhere. The one-line wrapper scales:

  • Same pattern for every op
  • Shared config via environment
  • Baselines and alerts adapt as you add new models or prompts
  • Schema validation evolves with your contracts

When you need exceptions (e.g., a performance-critical endpoint), use sampling or disable selectively:

with monitor(op="hot_path", model="gpt-4o", sample=0.1):
    ...

Putting It All Together

  • Start with one operation. Wrap it with the one-line context manager or wrapper.
  • Add a schema for any structured output. Choose soft vs. hard mode.
  • Tag your operation with environment, service, and experiment labels.
  • Deploy and let baselines form. Set alert thresholds once the curve stabilizes.
  • Expand to other operations, models, and modalities.
  • Use drift alerts and dashboards to drive prompt and model changes safely.

Monitoring should not be an afterthought, and it shouldn’t require a rewrite. With a one-line integration, you get deep visibility, automatic drift detection, schema safety, and cost control—exactly the guardrails you need to move fast with confidence.

If you’ve been delaying LLM monitoring because of complexity, try the single context manager approach. Wrap an operation today, ship, and sleep better tonight.

Enjoyed this article?

Share it with your team or try Deadpipe free.