LLM Tracing with OpenTelemetry

OTel GenAI semantic conventions, manual and auto-instrumentation for Anthropic/LangChain, Langfuse native SDK patterns, cost tracking per trace, and Prometheus alerting thresholds.

Distributed tracing for LLM systems. Every LLM call, retrieval, tool execution, and agent step should be a span. Without tracing you're flying blind. You can't debug latency, cost overruns, or quality regressions.


Why LLM Systems Need Tracing

A RAG pipeline has 5-10 steps: embed query → search vector DB → rerank → build prompt → LLM call → parse output. When something goes wrong (wrong answer, high latency, high cost), you need to know which step failed. Tracing makes that visible.

Key signals to capture:

  • Latency: which step is slow? Is it the retrieval, the LLM, or parsing?
  • Token counts: per-call input/output tokens for cost attribution
  • Model and version: which model answered? Was it the right one?
  • Retrieval quality: what was retrieved? Was it relevant?
  • Errors: did any step fail? What was the error?

OpenTelemetry Semantic Conventions for LLMs

OTel added LLM-specific semantic conventions (GenAI conventions) in 2024. Key span attributes:

gen_ai.system              = "anthropic" | "openai" | "cohere"
gen_ai.request.model       = "claude-sonnet-4-6"
gen_ai.request.max_tokens  = 1024
gen_ai.request.temperature = 0.7
gen_ai.response.model      = "claude-sonnet-4-6"  # actual model used
gen_ai.usage.input_tokens  = 523
gen_ai.usage.output_tokens = 187
gen_ai.operation.name      = "chat" | "text_completion" | "embeddings"

Manual OTel Instrumentation

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import anthropic

# Setup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my_llm_app")

# Instrument a RAG pipeline
def rag_query(question: str) -> str:
    with tracer.start_as_current_span("rag_query") as root_span:
        root_span.set_attribute("question", question[:200])
        
        # Retrieval span
        with tracer.start_as_current_span("retrieve") as span:
            docs = vector_store.search(question, k=5)
            span.set_attribute("docs_retrieved", len(docs))
            span.set_attribute("query", question)
        
        # Reranking span
        with tracer.start_as_current_span("rerank") as span:
            docs = reranker.rerank(question, docs, top_n=3)
            span.set_attribute("docs_after_rerank", len(docs))
        
        # LLM call span
        with tracer.start_as_current_span("llm_call") as span:
            client = anthropic.Anthropic()
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=512,
                messages=[{"role": "user", "content": build_prompt(question, docs)}],
            )
            span.set_attribute("gen_ai.system", "anthropic")
            span.set_attribute("gen_ai.request.model", "claude-sonnet-4-6")
            span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
            span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
            answer = response.content[0].text
        
        root_span.set_attribute("answer_length", len(answer))
        return answer

Auto-Instrumentation

Libraries handle the instrumentation automatically:

# OpenLLMetry — auto-instruments OpenAI, Anthropic, LangChain, LlamaIndex
from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor

AnthropicInstrumentor().instrument()

# Now every anthropic.Anthropic().messages.create() call is automatically traced
# with all gen_ai.* attributes populated
pip install opentelemetry-instrumentation-anthropic
pip install opentelemetry-instrumentation-openai
pip install opentelemetry-instrumentation-langchain

Langfuse as OTel Backend

Langfuse accepts OTel traces and shows them as traces/spans in its UI:

from langfuse.opentelemetry import configure_langfuse_tracing

configure_langfuse_tracing(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com",
)

# All OTel spans now appear in Langfuse

Or use Langfuse's native SDK (more features):

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse()

@observe()  # auto-creates a trace for each function call
def rag_pipeline(question: str) -> str:
    with langfuse_context.observe(name="retrieve") as span:
        docs = vector_store.search(question)
        span.update(metadata={"num_docs": len(docs)})
    
    with langfuse_context.observe(name="llm_call") as span:
        response = call_llm(question, docs)
        span.update(
            usage={"input": response.usage.input_tokens, "output": response.usage.output_tokens},
            model="claude-sonnet-4-6",
        )
    
    return response

LangSmith Integration

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_..."
os.environ["LANGCHAIN_PROJECT"] = "my_rag_project"

# All LangChain calls now traced to LangSmith automatically
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-6")
# Every call to llm.invoke() is traced

Cost Tracking

Track cost per trace to find expensive paths:

PRICING = {
    "claude-sonnet-4-6": {"input": 3.0, "output": 15.0},  # per million tokens
    "claude-haiku-4-5-20251001": {"input": 1.0, "output": 5.0},
    "claude-opus-4-7": {"input": 5.0, "output": 25.0},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    p = PRICING[model]
    return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000

# In your span
span.set_attribute("cost_usd", calculate_cost(model, in_tokens, out_tokens))

Alerting on Quality Signals

Set up alerts for:

  • P99 latency > 10s (SLA breach)
  • Error rate > 1% (model/API issues)
  • Average tokens/call > threshold (prompt bloat)
  • Cached token ratio < expected (prompt cache regression)
# Prometheus metrics alongside OTel traces
from prometheus_client import Counter, Histogram

llm_latency = Histogram("llm_call_seconds", "LLM call latency", ["model", "operation"])
llm_errors = Counter("llm_errors_total", "LLM errors", ["model", "error_type"])
llm_tokens = Counter("llm_tokens_total", "LLM tokens used", ["model", "type"])

# Instrument
with llm_latency.labels(model="claude-sonnet-4-6", operation="chat").time():
    response = client.messages.create(...)
    llm_tokens.labels(model="claude-sonnet-4-6", type="input").inc(response.usage.input_tokens)
    llm_tokens.labels(model="claude-sonnet-4-6", type="output").inc(response.usage.output_tokens)

Key Facts

  • OTel GenAI conventions (2024): gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens are the standard span attributes
  • opentelemetry-instrumentation-anthropic: auto-instruments all Anthropic SDK calls with gen_ai.* attributes
  • Langfuse @observe() decorator: auto-creates a trace per function call with zero-boilerplate span creation
  • LangSmith auto-tracing: set LANGCHAIN_TRACING_V2=true — every LangChain/LangGraph call is captured
  • Alerting thresholds: P99 latency >10s, error rate >1%, cached token ratio below expected
  • Cost tracking: Sonnet 4.6 $3/$15 per M, Haiku 4.5 $1/$5, Opus 4.7 $5/$25

Common Failure Cases

AnthropicInstrumentor().instrument() called after the anthropic.Anthropic() client is constructed, so no traces are captured
Why: OpenTelemetry instrumentation patches the SDK at import/construction time; calling instrument() after the client is already instantiated does not patch existing instances.
Detect: no spans appear in the tracing backend despite instrument() being called; adding a log line before the first LLM call shows the instrumentor registered, but spans are missing.
Fix: call AnthropicInstrumentor().instrument() before creating any anthropic.Anthropic() instances; place it at the top of your application entry point, before other imports that trigger client construction.

BatchSpanProcessor silently drops spans during high-throughput bursts
Why: BatchSpanProcessor has a fixed queue size (default 2048 spans); when the queue fills faster than the exporter can flush, spans are dropped without errors.
Detect: span count in the tracing backend is consistently lower than expected during load tests; no errors in application logs.
Fix: increase max_queue_size and max_export_batch_size in BatchSpanProcessor; or switch to SimpleSpanProcessor for lower-throughput applications where latency from synchronous export is acceptable.

Langfuse @observe() decorator creates a new root trace for every nested function call instead of nesting spans
Why: @observe() creates a root trace when no parent context exists in the current thread; if the decorated function is called from a thread pool executor or background task, the parent trace context is not propagated.
Detect: Langfuse shows dozens of single-span traces instead of one nested trace per user request; the hierarchy is flat.
Fix: propagate the OTel context explicitly to background tasks using contextvars.copy_context(); or use langfuse_context.update_current_trace() to attach orphan spans to the correct parent trace.

Cost tracking shows incorrect amounts because token pricing table is not updated after a model price change
Why: the hardcoded pricing dictionary in application code is not updated when providers change their pricing; the computed cost is wrong for months without anyone noticing.
Detect: calculated cost per call diverges from the provider's invoice; the delta matches the gap between hardcoded and current prices.
Fix: load pricing from an external source (LiteLLM's cost database, or a config file) rather than hardcoding; add a CI test that validates the pricing table against the provider's published API pricing page.

LangSmith traces appear empty (no messages) for LangChain LCEL chains using .batch()
Why: .batch() runs chains in parallel using a thread pool; each thread gets a new LangSmith trace context, orphaning spans from the parent run.
Detect: individual LCEL steps show as separate root-level runs in LangSmith rather than children of the batch run.
Fix: pass config={"callbacks": parent_run_manager.get_child()} to .batch() calls to propagate the parent callback context; or use RunnableConfig to thread the run manager through.

Connections

Open Questions

  • Will the OTel GenAI semantic conventions stabilise at 1.0 in 2026, and will Anthropic SDK ship official OTel instrumentation?
  • How does trace sampling strategy affect cost attribution accuracy for high-traffic production systems?
  • Can Prometheus alerting on token budgets reliably catch agent runaway before it causes significant cost overruns?