LangGraph

LangGraph v1.0 is the production standard for stateful multi-agent orchestration in Python, offering fine-grained graph-based control with built-in checkpointing and human-in-the-loop support.

Graph-based agent runtime from LangChain. Went GA as v1.0 in October 2025 and became the default runtime for production multi-agent systems in Python.

[Source: Perplexity research, 2026-04-29]


Core Abstraction

LangGraph models agent execution as a directed graph:

  • Nodes — Python functions (or RunnableLambdas) that read from state and write back to state
  • Edges — routing logic; conditional edges branch based on state values
  • State — a typed TypedDict that flows through every node; the single source of truth for the graph run
from langgraph.graph import StateGraph, END
from typing import TypedDict

class AgentState(TypedDict):
    messages: list
    tool_calls: list

graph = StateGraph(AgentState)
graph.add_node("llm", call_llm)
graph.add_node("tools", run_tools)
graph.add_conditional_edges("llm", route_after_llm, {"tools": "tools", "end": END})
graph.add_edge("tools", "llm")
graph.set_entry_point("llm")
app = graph.compile()

Key Features

Checkpointing

Persistent state snapshots after every node execution. Enables:

  • Resumability — pick up mid-run after a crash
  • Time-travel debugging — replay any historical state
  • Human-in-the-loop — pause at interrupt_before / interrupt_after hooks, collect input, resume

Backends: MemorySaver (in-process, dev only), SqliteSaver, PostgresSaver (production).

Human-in-the-Loop

app = graph.compile(interrupt_before=["tools"])  # pause before tool execution

The graph halts, surfaces state to a human, and resumes when app.invoke(None, config) is called with the original thread ID.

Streaming

Four streaming modes: values (full state), updates (node deltas), messages (LLM token stream), custom (arbitrary events). Critical for production UX.


Multi-Agent Patterns

Supervisor

A central supervisor LLM routes tasks to specialist sub-agents. Each sub-agent is itself a compiled graph. The supervisor maintains a shared state and delegates based on tool/capability routing.

Swarm / Handoffs

Agents transfer control directly to each other using handoff tools. No central coordinator. The active agent changes by returning a Command(goto="agent_name").

Sequential Chaining

Simple pipeline: Agent A → Agent B → Agent C. Each graph's output becomes the next graph's input. Useful for preprocessing → analysis → formatting pipelines.


LangGraph vs Alternatives

RuntimeModelStateStrength
LangGraphGraph (nodes/edges)Typed TypedDictFine-grained control, checkpointing
CrewAIRole-based crewsShared crew memoryFast to prototype, opinionated
AutoGen/AG2Event-driven actorsPer-agent memoryComplex conversation topologies
Google ADKA2A protocolGoogle Cloud nativeInterop with Vertex agents

LangGraph wins on control and observability. CrewAI wins on time-to-first-demo.


LangGraph Cloud

Managed runtime for LangGraph graphs. Features:

  • Persistent checkpointers (no self-hosted database needed)
  • Built-in streaming
  • Horizontal scaling
  • Studio UI (visual debugger, state inspector)

Integration with LangSmith

LangGraph traces are automatically sent to LangSmith when LANGCHAIN_TRACING_V2=true. Every node execution appears as a span. Essential for debugging complex multi-hop agent runs.


Production Considerations

  • State size — grows unbounded in long runs; prune or summarise messages
  • Parallel nodesgraph.add_node supports branching; merge with RunnableParallel
  • Tool errors — catch inside tool nodes; don't let unhandled exceptions abort the graph
  • Context windows — in multi-agent setups, each sub-agent gets a fresh context; cross-agent memory requires explicit handoff
  • Cost gates — add an observability node that counts tokens and hard-stops at threshold

Key Facts

  • Went GA as v1.0 in October 2025; prior to that it was the de facto standard but pre-release
  • Core abstractions: Nodes (Python functions), Edges (routing logic), State (typed TypedDict — single source of truth)
  • Four streaming modes: values, updates, messages, custom
  • Checkpointing backends: MemorySaver (dev only), SqliteSaver, PostgresSaver (production)
  • Human-in-the-loop via interrupt_before / interrupt_after hooks — graph halts, resumes on app.invoke(None, config)
  • Three multi-agent patterns: Supervisor (central LLM router), Swarm/Handoffs (Command(goto=...)), Sequential Chaining
  • Traces auto-sent to LangSmith when LANGCHAIN_TRACING_V2=true; every node appears as a span
  • LangGraph wins vs alternatives on control and observability; CrewAI wins on time-to-first-demo

Common Failure Cases

Graph state grows unbounded, context window overflows on long runs
Why: messages are appended to state on every node but never pruned; after 50+ turns the full message history exceeds the model's context limit.
Detect: InvalidRequestError: prompt is too long or silent truncation; trace shows state size growing linearly with turn count.
Fix: add a summarisation node that compresses old messages; or use a message trimmer that keeps the last N turns plus the system prompt.

Conditional edge routes to a non-existent node
Why: a routing function returns a string key that doesn't match any registered node name (typo, or node was renamed).
Detect: KeyError or GraphValueError when the conditional edge fires; only manifests on the branch that wasn't exercised in testing.
Fix: use an enum or a constant for node names; write a test that exercises every conditional branch.

Checkpointer backend not set — resume fails after crash
Why: MemorySaver was used in development; in production the graph crashes and all state is lost; no resumption is possible.
Detect: agent restart begins from scratch instead of resuming mid-task; thread ID exists in the application but not in the checkpoint store.
Fix: use PostgresSaver in production with the thread ID stored by the caller; add a startup health check that validates the checkpoint backend is reachable.

Tool error inside a node aborts the whole graph
Why: an unhandled exception in a tool call propagates up through the node function and terminates the graph run.
Detect: graph terminates with a tool error instead of retrying or routing to an error-handling node.
Fix: wrap tool calls in try/except inside the node; return an error message in the state so the LLM can reason about the failure and retry.

Parallel node writes conflict on the same state key
Why: two nodes executing in parallel both write to the same state field; the second write silently overwrites the first.
Detect: data from one parallel branch disappears from state; add logging at the end of each parallel node to capture state snapshots.
Fix: give each parallel branch a distinct state key; merge the results in a join node after the parallel section.

Connections

  • agents/langchain — LangChain is the base framework LangGraph is built on; use LangChain for simple chains, LangGraph for stateful workflows
  • agents/react-pattern — the ReAct loop (observe, think, act) is what individual LangGraph nodes implement
  • protocols/mcp — MCP tool servers connect to LangGraph as tool nodes; security considerations apply
  • observability/platforms — LangSmith is the native tracing target; Langfuse works via OTel
  • apis/anthropic-api — Claude is called from within graph nodes using the Anthropic Messages API
  • agents/multi-agent-patterns — covers CrewAI, AutoGen, and Swarm as alternatives to LangGraph
  • security/mcp-cves — MCP tools integrated into LangGraph nodes inherit MCP's attack surface

Open Questions

  • How does LangGraph v1.0 compare to the OpenAI Agents SDK (released early 2025) on production workloads — which wins on latency, cost, and developer ergonomics for Claude-based agents?
  • What are the practical limits of PostgresSaver checkpointing at scale — what checkpoint size and throughput does it support before becoming a bottleneck?
  • Does LangGraph Cloud's horizontal scaling model handle stateful graphs correctly when a single thread's state is large (e.g., multi-turn agent with large message history)?