LLM Tracing with OpenTelemetry
OTel GenAI semantic conventions, manual and auto-instrumentation for Anthropic/LangChain, Langfuse native SDK patterns, cost tracking per trace, and Prometheus alerting thresholds.
Distributed tracing for LLM systems. Every LLM call, retrieval, tool execution, and agent step should be a span. Without tracing you're flying blind. You can't debug latency, cost overruns, or quality regressions.
Why LLM Systems Need Tracing
A RAG pipeline has 5-10 steps: embed query → search vector DB → rerank → build prompt → LLM call → parse output. When something goes wrong (wrong answer, high latency, high cost), you need to know which step failed. Tracing makes that visible.
Key signals to capture:
- Latency: which step is slow? Is it the retrieval, the LLM, or parsing?
- Token counts: per-call input/output tokens for cost attribution
- Model and version: which model answered? Was it the right one?
- Retrieval quality: what was retrieved? Was it relevant?
- Errors: did any step fail? What was the error?
OpenTelemetry Semantic Conventions for LLMs
OTel added LLM-specific semantic conventions (GenAI conventions) in 2024. Key span attributes:
gen_ai.system = "anthropic" | "openai" | "cohere"
gen_ai.request.model = "claude-sonnet-4-6"
gen_ai.request.max_tokens = 1024
gen_ai.request.temperature = 0.7
gen_ai.response.model = "claude-sonnet-4-6" # actual model used
gen_ai.usage.input_tokens = 523
gen_ai.usage.output_tokens = 187
gen_ai.operation.name = "chat" | "text_completion" | "embeddings"
Manual OTel Instrumentation
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import anthropic
# Setup
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my_llm_app")
# Instrument a RAG pipeline
def rag_query(question: str) -> str:
with tracer.start_as_current_span("rag_query") as root_span:
root_span.set_attribute("question", question[:200])
# Retrieval span
with tracer.start_as_current_span("retrieve") as span:
docs = vector_store.search(question, k=5)
span.set_attribute("docs_retrieved", len(docs))
span.set_attribute("query", question)
# Reranking span
with tracer.start_as_current_span("rerank") as span:
docs = reranker.rerank(question, docs, top_n=3)
span.set_attribute("docs_after_rerank", len(docs))
# LLM call span
with tracer.start_as_current_span("llm_call") as span:
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": build_prompt(question, docs)}],
)
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.request.model", "claude-sonnet-4-6")
span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
answer = response.content[0].text
root_span.set_attribute("answer_length", len(answer))
return answerAuto-Instrumentation
Libraries handle the instrumentation automatically:
# OpenLLMetry — auto-instruments OpenAI, Anthropic, LangChain, LlamaIndex
from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor
AnthropicInstrumentor().instrument()
# Now every anthropic.Anthropic().messages.create() call is automatically traced
# with all gen_ai.* attributes populatedpip install opentelemetry-instrumentation-anthropic
pip install opentelemetry-instrumentation-openai
pip install opentelemetry-instrumentation-langchainLangfuse as OTel Backend
Langfuse accepts OTel traces and shows them as traces/spans in its UI:
from langfuse.opentelemetry import configure_langfuse_tracing
configure_langfuse_tracing(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="https://cloud.langfuse.com",
)
# All OTel spans now appear in LangfuseOr use Langfuse's native SDK (more features):
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
langfuse = Langfuse()
@observe() # auto-creates a trace for each function call
def rag_pipeline(question: str) -> str:
with langfuse_context.observe(name="retrieve") as span:
docs = vector_store.search(question)
span.update(metadata={"num_docs": len(docs)})
with langfuse_context.observe(name="llm_call") as span:
response = call_llm(question, docs)
span.update(
usage={"input": response.usage.input_tokens, "output": response.usage.output_tokens},
model="claude-sonnet-4-6",
)
return responseLangSmith Integration
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_..."
os.environ["LANGCHAIN_PROJECT"] = "my_rag_project"
# All LangChain calls now traced to LangSmith automatically
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-6")
# Every call to llm.invoke() is tracedCost Tracking
Track cost per trace to find expensive paths:
PRICING = {
"claude-sonnet-4-6": {"input": 3.0, "output": 15.0}, # per million tokens
"claude-haiku-4-5-20251001": {"input": 1.0, "output": 5.0},
"claude-opus-4-7": {"input": 5.0, "output": 25.0},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
p = PRICING[model]
return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000
# In your span
span.set_attribute("cost_usd", calculate_cost(model, in_tokens, out_tokens))Alerting on Quality Signals
Set up alerts for:
- P99 latency > 10s (SLA breach)
- Error rate > 1% (model/API issues)
- Average tokens/call > threshold (prompt bloat)
- Cached token ratio < expected (prompt cache regression)
# Prometheus metrics alongside OTel traces
from prometheus_client import Counter, Histogram
llm_latency = Histogram("llm_call_seconds", "LLM call latency", ["model", "operation"])
llm_errors = Counter("llm_errors_total", "LLM errors", ["model", "error_type"])
llm_tokens = Counter("llm_tokens_total", "LLM tokens used", ["model", "type"])
# Instrument
with llm_latency.labels(model="claude-sonnet-4-6", operation="chat").time():
response = client.messages.create(...)
llm_tokens.labels(model="claude-sonnet-4-6", type="input").inc(response.usage.input_tokens)
llm_tokens.labels(model="claude-sonnet-4-6", type="output").inc(response.usage.output_tokens)Key Facts
- OTel GenAI conventions (2024): gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens are the standard span attributes
- opentelemetry-instrumentation-anthropic: auto-instruments all Anthropic SDK calls with gen_ai.* attributes
- Langfuse @observe() decorator: auto-creates a trace per function call with zero-boilerplate span creation
- LangSmith auto-tracing: set LANGCHAIN_TRACING_V2=true — every LangChain/LangGraph call is captured
- Alerting thresholds: P99 latency >10s, error rate >1%, cached token ratio below expected
- Cost tracking: Sonnet 4.6 $3/$15 per M, Haiku 4.5 $1/$5, Opus 4.7 $5/$25
Common Failure Cases
AnthropicInstrumentor().instrument() called after the anthropic.Anthropic() client is constructed, so no traces are captured
Why: OpenTelemetry instrumentation patches the SDK at import/construction time; calling instrument() after the client is already instantiated does not patch existing instances.
Detect: no spans appear in the tracing backend despite instrument() being called; adding a log line before the first LLM call shows the instrumentor registered, but spans are missing.
Fix: call AnthropicInstrumentor().instrument() before creating any anthropic.Anthropic() instances; place it at the top of your application entry point, before other imports that trigger client construction.
BatchSpanProcessor silently drops spans during high-throughput bursts
Why: BatchSpanProcessor has a fixed queue size (default 2048 spans); when the queue fills faster than the exporter can flush, spans are dropped without errors.
Detect: span count in the tracing backend is consistently lower than expected during load tests; no errors in application logs.
Fix: increase max_queue_size and max_export_batch_size in BatchSpanProcessor; or switch to SimpleSpanProcessor for lower-throughput applications where latency from synchronous export is acceptable.
Langfuse @observe() decorator creates a new root trace for every nested function call instead of nesting spans
Why: @observe() creates a root trace when no parent context exists in the current thread; if the decorated function is called from a thread pool executor or background task, the parent trace context is not propagated.
Detect: Langfuse shows dozens of single-span traces instead of one nested trace per user request; the hierarchy is flat.
Fix: propagate the OTel context explicitly to background tasks using contextvars.copy_context(); or use langfuse_context.update_current_trace() to attach orphan spans to the correct parent trace.
Cost tracking shows incorrect amounts because token pricing table is not updated after a model price change
Why: the hardcoded pricing dictionary in application code is not updated when providers change their pricing; the computed cost is wrong for months without anyone noticing.
Detect: calculated cost per call diverges from the provider's invoice; the delta matches the gap between hardcoded and current prices.
Fix: load pricing from an external source (LiteLLM's cost database, or a config file) rather than hardcoding; add a CI test that validates the pricing table against the provider's published API pricing page.
LangSmith traces appear empty (no messages) for LangChain LCEL chains using .batch()
Why: .batch() runs chains in parallel using a thread pool; each thread gets a new LangSmith trace context, orphaning spans from the parent run.
Detect: individual LCEL steps show as separate root-level runs in LangSmith rather than children of the batch run.
Fix: pass config={"callbacks": parent_run_manager.get_child()} to .batch() calls to propagate the parent callback context; or use RunnableConfig to thread the run manager through.
Connections
- observability/platforms — Langfuse, LangSmith, Arize Phoenix platform comparison
- python/ecosystem — structlog for structured logging alongside OTel traces
- evals/methodology — online evals that plug into the tracing pipeline
- agents/langgraph — agent step tracing in LangGraph
Open Questions
- Will the OTel GenAI semantic conventions stabilise at 1.0 in 2026, and will Anthropic SDK ship official OTel instrumentation?
- How does trace sampling strategy affect cost attribution accuracy for high-traffic production systems?
- Can Prometheus alerting on token budgets reliably catch agent runaway before it causes significant cost overruns?
Related reading
More in Observability