AI Agents Codex Claude APIs

Codex, Claude, and AI Agents: the boring architecture that actually ships work

Mar 01, 2026 01. März 2026 10 min read 10 Min. Lesezeit

Codex vs Claude: not “which is smarter,” but “where the work happens”

Codex is positioned as a software engineering agent: it can work on tasks like writing features, fixing bugs, answering questions about a codebase, and producing reviewable changes — often in isolated environments that can be preloaded with your repository. The repo-first workflow is the key difference. It’s designed around diffs, not just text.

Claude’s developer story leans heavily into tool use: you define tools, Claude decides when to call them, and you feed tool results back through the Messages API. For messy, multi-step problems (docs plus tickets plus decisions plus writing), that API shape is extremely practical.

In real life:

If the deliverable is “merge this change,” Codex-style workflows feel native.
If the deliverable is “understand a messy situation and coordinate tools,” Claude-style workflows feel native.

Neither is universally better. They’re optimized for different battlegrounds.

What AI agents actually are

An agent is a system that repeatedly does this:

Interpret the goal
Decide the next action
Call a tool (or produce output)
Observe the result
Repeat until done (or until you stop it)

This “reason and act” loop is basically what the ReAct pattern formalized: interleaving reasoning with actions, then updating based on tool results.

Modern APIs make the action part structured instead of vibes. With tool calling, you provide a schema for tools, the model emits structured arguments, and your app executes them. OpenAI and Anthropic expose different request shapes, but the architecture is the same.

The three agent patterns that consistently pay off

The way I think about agents is not “which framework,” but “which pattern reliably produces outcomes.”

First is the repo worker: scoped tasks inside a codebase with reviewable output. Think “junior dev who never sleeps, but needs review.” It’s good for refactors, test generation, migrations, and bugfixes with reproduction steps. It’s bad for vague product decisions (“make it feel premium”) and anything without a test harness or a clear acceptance check.

Second is the research and write worker: read a pile of information, call internal tools, and produce a coherent artifact. Specs, docs, client emails, strategy notes, and decision memos. This works best when you force grounding: search your docs first, cite tool output, and if you can’t find it, say so.

Third is the ops automator: tools-first, model-second. This is the agent that pays rent because it moves data between systems. The model does extraction, classification, and decisions — not creative writing. Typical flows are things like inbound lead → extract fields → CRM entry → Slack ping → follow-up task, or support email → classify → fetch account → draft response → create ticket, or form submission → validate → DB write → invoice → onboarding mail. This often lives in n8n because iterating on integrations is fast and observable.

A minimal agent loop that won’t embarrass you

Here’s the shape I trust: small toolset, strict schemas, a hard iteration limit, and an explicit stop condition. The point is not to be clever. It’s to be predictable.

type ToolDef = {
  name: string;
  description: string;
  parameters: Record<string, unknown>; // JSON Schema
};

async function runAgent(goal: string) {
  const tools: ToolDef[] = [
    {
      name: "searchDocs",
      description: "Search internal docs and return top matches.",
      parameters: {
        type: "object",
        properties: { query: { type: "string" } },
        required: ["query"],
        additionalProperties: false
      }
    },
    {
      name: "createTicket",
      description: "Create a ticket with title, body, and priority.",
      parameters: {
        type: "object",
        properties: {
          title: { type: "string" },
          body: { type: "string" },
          priority: { type: "string", enum: ["low", "medium", "high"] }
        },
        required: ["title", "body", "priority"],
        additionalProperties: false
      }
    }
  ];

  const messages = [
    { role: "system", content: "Be reliable. Use tools when needed. Stop when the task is complete." },
    { role: "user", content: goal }
  ];

  for (let step = 0; step < 6; step++) {
    const resp = await callModel({ messages, tools });

    if (resp.type === "final") return resp.text;

    if (resp.type === "tool_call") {
      const result = await executeTool(resp.toolName, resp.args);
      messages.push(resp.asMessage());
      messages.push({ role: "tool", name: resp.toolName, content: JSON.stringify(result) });
      continue;
    }

    throw new Error("Unexpected model response");
  }

  return "Stopped: step limit reached. Returning partial output + next suggested action.";
}

That exact idea maps to both OpenAI tool calling and Anthropic tool use — different request shapes, same architecture.

MCP: the “stop rewriting integrations” move

If you’ve ever built the same tool integration twice (once for each model or vendor), you already know the pain.

Model Context Protocol (MCP) is an open protocol for connecting LLM applications to external tools and data sources through a standardized interface. The important bit is not ideology. It’s interoperability. You expose tools through an MCP server, and different clients or models can connect without bespoke glue each time.

OpenAI’s “Using tools” guide explicitly mentions remote MCP servers as part of the tooling ecosystem, which is a strong signal where this is going: tools become portable, models become interchangeable.

The failure modes nobody likes talking about

Agents fail in boring ways:

Too much permission (an agent that can delete will eventually delete)
No grounding (it answers without checking tools)
No verification (code changes without tests are just faster bugs)
No observability (if you can’t replay steps, you can’t debug)
No budget strategy (one slow tool call can stall the whole run)

That’s why the boring agent wins: allowlist tools, strict schemas, timeouts and retries, logs, and human approval for anything irreversible.

When I reach for what

If the job is “change the repo,” I want a Codex-style workflow that produces reviewable code output.

If the job is “understand, decide, and coordinate tools,” I want Claude-style tool use and multi-turn orchestration.

If the job is “move data between systems,” I want a tools-first setup (often n8n, sometimes custom code), with the model doing extraction and decisions, not driving the whole system.

And if I’m building more than one integration, MCP starts looking like the adult choice.

The professional conclusion

Agents aren’t a product category. They’re an architecture pattern.

If you treat them like engineering — tools, schemas, permissions, logs, tests — they’re absurdly useful. If you treat them like magic, they’re just a faster way to produce confident nonsense.

Codex, Claude, and agent orchestration frameworks are all valid picks. The only wrong move is picking based on hype instead of where the work actually happens.

Wichtigstes zusammengefasst: Agent = Loop + Tools + Regeln. Codex ist ideal für repo-native Code-Änderungen mit Review. Claude ist stark bei reasoning plus tool-heavy Workflows. MCP wird relevant, sobald du Integrationen nicht pro Modell neu bauen willst.