How to Build Deterministic Agentic Workflows With AI Reasoning
AI agents are impressive in demos and chaotic in production. They skip steps, hallucinate tool calls, and take different paths every time you run them. The fix isn't removing AI reasoning β it's constraining where reasoning happens while keeping everything else deterministic.
This guide shows you how to build agentic workflows where the orchestration is rock-solid and predictable, but individual steps still leverage LLM intelligence for decisions that genuinely need it.
π What You'll Need
- Python 3.10+ with
asynciosupport - LangGraph (
pip install langgraph) or a state machine library - An LLM API β OpenAI, Anthropic, or a local model via Ollama
- Pydantic for structured output validation
- Basic understanding of state machines and directed graphs
π§ The Core Problem: Agents vs. Workflows
Most teams building with LLMs fall into one of two traps:
Trap 1: Full autonomy. Hand the LLM a goal and let it figure everything out. Works for simple tasks. Falls apart on anything multi-step β the agent skips validation, invents tool calls, or loops forever.
Trap 2: Pure determinism. Hard-code every step in a DAG. No flexibility. The moment requirements change or input data varies, you're rewriting pipeline code.
The answer is a hybrid: deterministic orchestration with bounded AI reasoning.
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Deterministic Orchestrator β
β β
β βββββββββββ βββββββββββ βββββββββββ β
β β State A βββββΊβ State B βββββΊβ State C β β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
β β β β β
β ββββββΌβββββ ββββββΌβββββ ββββββΌβββββ β
β β LLM Callβ β Tool β β LLM Callβ β
β β(bounded)β β(determi-β β(bounded)β β
β βββββββββββ β nistic) β βββββββββββ β
β βββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
The orchestrator decides what happens next. The LLM decides how to do specific subtasks within strict boundaries.
ποΈ Pattern 1: State Machine Orchestration
The most reliable pattern for deterministic agentic workflows is a state machine where each state represents a phase of work, and transitions follow explicit rules.
Why State Machines Beat Free-Form Agents
| Aspect | Free-Form Agent | State Machine Agent |
|---|---|---|
| Execution path | Different every run | Same states, same order |
| Debugging | Read the entire trace | Check which state failed |
| Recovery | Restart from scratch | Resume from last state |
| Testing | Mock the whole LLM | Test each state independently |
| Auditability | "The AI decided" | Explicit transition logs |
Implementation With LangGraph
LangGraph models workflows as directed graphs where nodes are processing steps and edges define flow control.
from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal
from pydantic import BaseModel
# Define your workflow state
class WorkflowState(TypedDict):
task: str
plan: list[str]
results: list[dict]
status: str
error: str | None
# Define structured output for the planning step
class TaskPlan(BaseModel):
steps: list[str]
requires_review: bool
# Node 1: Plan the task (LLM reasoning, bounded)
def plan_task(state: WorkflowState) -> WorkflowState:
response = llm.with_structured_output(TaskPlan).invoke(
f"Break this task into 3-5 concrete steps: {state['task']}"
)
return {
"plan": response.steps,
"status": "planned"
}
# Node 2: Execute each step (deterministic)
def execute_steps(state: WorkflowState) -> WorkflowState:
results = []
for step in state["plan"]:
result = execute_tool(step) # deterministic tool call
results.append({"step": step, "result": result})
return {"results": results, "status": "executed"}
# Node 3: Validate results (LLM reasoning, bounded)
def validate_results(state: WorkflowState) -> WorkflowState:
validation = llm.invoke(
f"Check these results for errors: {state['results']}"
)
return {"status": "validated" if validation.passed else "failed"}
# Routing function (deterministic)
def route_after_validation(state: WorkflowState) -> Literal["end", "plan_task"]:
if state["status"] == "validated":
return "end"
return "plan_task" # retry with new plan
# Build the graph
graph = StateGraph(WorkflowState)
graph.add_node("plan_task", plan_task)
graph.add_node("execute_steps", execute_steps)
graph.add_node("validate_results", validate_results)
graph.set_entry_point("plan_task")
graph.add_edge("plan_task", "execute_steps")
graph.add_edge("execute_steps", "validate_results")
graph.add_conditional_edges("validate_results", route_after_validation, {
"end": END,
"plan_task": "plan_task"
})
workflow = graph.compile()
The LLM contributes reasoning at two points β planning and validation. Everything else follows fixed paths.
π Pattern 2: Structured Outputs as Guardrails
The single most important technique for deterministic agentic workflows: force every LLM response into a schema.
Without structured outputs, you're parsing free text and hoping. With them, you get validated, typed data every time.
from pydantic import BaseModel, Field
from enum import Enum
class ActionType(str, Enum):
SEARCH = "search"
CALCULATE = "calculate"
WRITE = "write"
DONE = "done"
class AgentDecision(BaseModel):
"""What the agent decides to do next."""
action: ActionType
parameters: dict = Field(description="Action-specific parameters")
reasoning: str = Field(max_length=200, description="Brief justification")
confidence: float = Field(ge=0.0, le=1.0)
# Force the LLM to respond in this exact format
decision = llm.with_structured_output(AgentDecision).invoke(
"Given the current state, what's the next action?"
)
# Now you can route deterministically
match decision.action:
case ActionType.SEARCH:
result = search_tool(**decision.parameters)
case ActionType.CALCULATE:
result = calculate(**decision.parameters)
case ActionType.WRITE:
result = write_output(**decision.parameters)
case ActionType.DONE:
return finalize(decision)
.with_structured_output() in LangChain, or response_format in the raw API. This guarantees valid JSON matching your schema β no regex parsing needed.
The Confidence Gate Pattern
Add a confidence threshold to create automatic escalation:
if decision.confidence < 0.7:
# Low confidence: escalate to human or use fallback
result = human_review(decision)
elif decision.confidence < 0.9:
# Medium confidence: proceed but log for review
result = execute_with_logging(decision)
else:
# High confidence: execute directly
result = execute(decision)
This keeps the workflow deterministic while letting AI reasoning influence the path.
β‘ Pattern 3: Temporal's Deterministic Orchestration
For production systems that need crash recovery and durable execution, Temporal provides the strongest guarantees. Its key insight: separate the deterministic orchestration layer from non-deterministic LLM calls.
from temporalio import workflow, activity
from datetime import timedelta
@activity.defn
async def llm_decide_next_action(goal: str, history: list) -> dict:
"""Non-deterministic: LLM reasoning happens here."""
response = await llm.ainvoke(
f"Goal: {goal}\nHistory: {history}\nWhat's the next action?"
)
return {"action": response.action, "params": response.params}
@activity.defn
async def execute_tool(action: str, params: dict) -> dict:
"""Deterministic: tool execution with fixed behavior."""
return tool_registry[action](**params)
@workflow.defn
class AgentWorkflow:
@workflow.run
async def run(self, goal: str) -> str:
history = []
for _ in range(10): # hard cap on iterations
# LLM decides (non-deterministic, but recorded)
decision = await workflow.execute_activity(
llm_decide_next_action,
args=[goal, history],
start_to_close_timeout=timedelta(seconds=30),
)
if decision["action"] == "done":
break
# Tool executes (deterministic)
result = await workflow.execute_activity(
execute_tool,
args=[decision["action"], decision["params"]],
start_to_close_timeout=timedelta(seconds=60),
)
history.append({"decision": decision, "result": result})
return format_output(history)
Why Temporal Matters for Agents
| Without Temporal | With Temporal |
|---|---|
| Agent crashes β restart from zero | Agent crashes β replays from event history |
| LLM called again β different answer | Recorded LLM response β same answer |
| No visibility into what happened | Full event history with every decision |
| Manual checkpointing code | Automatic durable execution |
ποΈ Pattern 4: Supervisor + Specialists
For complex workflows requiring multiple domains of expertise, use a supervisor pattern β one routing agent that delegates to specialized workers.
class SupervisorRouter(BaseModel):
"""Supervisor's routing decision."""
specialist: Literal["data_analyst", "code_writer", "reviewer"]
task_description: str
required_output_format: str
def supervisor_node(state: WorkflowState) -> WorkflowState:
routing = llm.with_structured_output(SupervisorRouter).invoke(
f"Route this task to the right specialist: {state['task']}"
)
return {"next_specialist": routing.specialist, "sub_task": routing.task_description}
def data_analyst_node(state: WorkflowState) -> WorkflowState:
# Narrow tools: only data queries and analysis
result = analyst_llm.invoke(state["sub_task"], tools=[query_db, plot_chart])
return {"results": [result]}
def code_writer_node(state: WorkflowState) -> WorkflowState:
# Narrow tools: only file operations
result = coder_llm.invoke(state["sub_task"], tools=[read_file, write_file, run_tests])
return {"results": [result]}
Key Rules for Supervisor Architecture
- Narrow tool access β each specialist gets only the tools it needs
- Typed handoffs β use Pydantic models for inter-agent communication
- Budget limits β set token and time caps per specialist
- Auditable routing β log the supervisor's routing decision with reasoning
π§ Production Hardening: The Checklist
Building the workflow is half the job. Making it production-ready is the other half.
Two-Phase Execution (Plan β Validate β Execute)
Never let an agent execute actions directly. Always split into phases:
# Phase 1: Agent proposes an action
proposed_action = agent.plan(task)
# Phase 2: Validate against business rules (deterministic)
validation = validate_action(proposed_action)
if not validation.is_valid:
return {"error": validation.reason}
# Phase 3: Execute the validated action
result = execute(proposed_action)
Error Handling That Actually Works
| Problem | Bad Solution | Good Solution |
|---|---|---|
| LLM returns invalid JSON | Retry indefinitely | Structured outputs + 2 retries max |
| Agent loops forever | No loop limit | Hard cap on iterations (5-10) |
| Tool call fails | Swallow the error | Exponential backoff + dead letter queue |
| Low-confidence decision | Ignore confidence | Gate on confidence score, escalate if low |
| Agent hallucinates a tool | Let it try and fail | Allowlist tools per state |
Observability Requirements
Every production agentic workflow needs:
- Correlation IDs across all steps for end-to-end tracing
- State transition logs β what state, what triggered the transition, what data changed
- LLM call logs β prompt, response, tokens used, latency
- Tool call logs β inputs, outputs, success/failure, duration
- Decision audit trail β why the agent chose action A over action B
import structlog
logger = structlog.get_logger()
def execute_with_tracing(state, action):
logger.info("state_transition",
correlation_id=state["correlation_id"],
from_state=state["current_state"],
action=action.action,
confidence=action.confidence,
reasoning=action.reasoning,
)
result = execute(action)
logger.info("action_result",
correlation_id=state["correlation_id"],
success=result.success,
duration_ms=result.duration_ms,
)
return result
π¬ Framework Comparison: Picking the Right Tool
| Framework | Determinism | Recovery | Learning Curve | Best For |
|---|---|---|---|---|
| π₯ LangGraph | β Graph-based state machine | β Checkpointing | π‘ Medium | Most agentic workflows |
| π₯ Temporal | β Durable execution | β Full replay | π΄ Steep | Mission-critical systems |
| π₯ LlamaIndex Workflows | β Event-driven | β οΈ Manual | π’ Easy | RAG-heavy pipelines |
| CrewAI | β οΈ Sequential only | β None | π’ Easy | Quick multi-agent prototypes |
| OpenAI Agents SDK | β οΈ Basic handoffs | β None | π’ Easy | Simple agent chains |
| Custom (asyncio) | β You control it | β οΈ Build it yourself | π΄ Steep | Full control, no dependencies |
π« Anti-Patterns to Avoid
These mistakes kill agentic workflows in production:
- Unbounded autonomy β letting the LLM decide workflow structure, not just subtask content
- Mega-prompts β stuffing routing logic, tool descriptions, and business rules into one prompt
- Permission logic in prompts β "Don't call the delete API unless the user confirmed" belongs in code, not in a system prompt
- Missing loop limits β every agent loop needs a hard cap (5-10 iterations max)
- Tool sprawl β giving agents 50 tools when they need 5. More tools = more hallucinated tool calls
- No evaluation harness β if you can't measure agent accuracy on golden test sets, you can't improve it
πΊοΈ What's Next
- π¨ Build a starter project β take the LangGraph example above and extend it with your own tools and states
- π Add evaluation β create golden test sets and run regression tests on every prompt change
- π Try Temporal β if you need crash recovery, Temporal's Python SDK is production-ready
- π Read Anthropic's agent patterns β their building effective agents guide covers complementary patterns
- π§ͺ Explore test-driven agents β pair this with TDD for AI agents for even more reliability
For a hands-on guide to building AI-powered development workflows, check out the Claude Code Workflow Guide and AI Coding Agents Compared.