When NOT to Use AI Agents: A Realistic Framework

We build AI agent infrastructure for a living. So when we say "don't use AI agents for this," it's worth paying attention.

The pressure to deploy agents is real. Board decks demand AI strategy slides. Engineering managers feel the pull to modernize workflows. Vendors (including us) are incentivized to show you everything agents can do. What gets left out of most conversations is the honest accounting of what agents do poorly, where they introduce unnecessary risk, and when a simpler solution is the objectively better engineering choice.

This post is that accounting. It won't tell you agents are overhyped across the board—they're not. But it will give you a decision framework rooted in the actual failure modes we've observed across hundreds of production deployments, not in marketing material.

The Core Problem: Agents Are Probabilistic Systems

Before getting to specific anti-patterns, it helps to internalize one foundational fact: AI agents are probabilistic. Even with the best orchestration, governance, and prompt engineering, a language model will occasionally produce unexpected output. Given identical inputs, it may produce different outputs on different runs. Temperature settings, model updates, context window truncation, and tool call failures all introduce variance.

This isn't a flaw to be engineered away—it's an intrinsic property of the technology. The question isn't whether you can eliminate that variance; you can't. The question is whether your target use case can tolerate it.

Most use cases that are good fits for agents can tolerate moderate variance in exchange for the flexibility and generalization that LLMs provide. The use cases we'll cover below cannot—or at least cannot tolerate it without mechanisms that end up costing more than the agent is worth.

Anti-Pattern 1: Deterministic Tasks with Computable Answers

The most common over-engineering mistake we see is wrapping a rule-based or computationally deterministic task in an LLM call.

Consider invoice processing. A company has invoices arriving in a structured XML format from a handful of known vendors. Field positions are consistent. Validation rules are well-defined: amounts must match purchase orders within a 2% tolerance, vendor IDs must exist in the supplier database, payment terms must fall within 30 or 60 days. If any condition fails, route to accounts payable for review.

This is a job for a state machine and a few database queries. An agent adds:

Latency: LLM inference is measured in seconds, not milliseconds. A rule engine is measured in microseconds.
Cost: Even at commodity model pricing, processing 50,000 invoices per month through an LLM incurs non-trivial token costs.
Variance: The agent might occasionally misread a field it's parsed correctly a thousand times before.
Debuggability problems: When an invoice routes incorrectly, "the model made an unexpected decision" is a far harder audit trail than "rule 4b failed because vendor_id was null."

# Don't do this for structured, rule-based validation
def validate_invoice_with_agent(invoice_xml: str) -> ValidationResult:
    response = llm.complete(
        f"Validate this invoice and check if it should be approved: {invoice_xml}"
    )
    return parse_agent_response(response)

# Do this instead
def validate_invoice(invoice: Invoice) -> ValidationResult:
    errors = []

    if not supplier_db.exists(invoice.vendor_id):
        errors.append(ValidationError("UNKNOWN_VENDOR", invoice.vendor_id))

    po = purchase_order_db.get(invoice.po_number)
    if po and abs(invoice.amount - po.amount) / po.amount > 0.02:
        errors.append(ValidationError("AMOUNT_MISMATCH",
                                       expected=po.amount,
                                       actual=invoice.amount))

    if invoice.payment_terms not in ALLOWED_PAYMENT_TERMS:
        errors.append(ValidationError("INVALID_TERMS", invoice.payment_terms))

    return ValidationResult(approved=len(errors) == 0, errors=errors)

The heuristic here is simple: if you can write the decision logic as explicit conditionals without losing meaningful coverage of real-world cases, write the conditionals.

Where agents do add value in document processing is at the edges: unstructured invoice formats from new vendors, handwritten or scanned documents that resist OCR, or classification tasks where the taxonomy is large and evolving. The structured core stays deterministic; agents handle the genuinely ambiguous periphery.

Anti-Pattern 2: Latency-Sensitive User-Facing Workflows

If a human is waiting synchronously for a response and that response must arrive in under 500 milliseconds to meet your SLA or UX expectations, agents are the wrong tool—at least in their current form.

This catches teams off guard because prototypes often feel fast. Local testing with a warmed model and short context windows doesn't reflect production conditions: cold-start latency, network round trips to provider APIs, tool calls that each add another 200–800ms, and the compounding effect of multi-step reasoning chains.

Consider a real-time product recommendation system on a high-traffic e-commerce platform. The engineering team wants to replace their collaborative filtering model with an agent that can incorporate natural language reasoning about why a product fits a customer's expressed preferences. The prototype is impressive in demos. In production:

P50 latency goes from 45ms to 1,400ms
P99 latency hits 4,200ms
During a traffic spike, provider API rate limits cause cascading timeouts
The recommendation quality improvement doesn't justify a 30x latency regression

The right architecture separates the fast path from the slow path. Run the agent offline or asynchronously to generate enriched product metadata, personalization signals, or segment-level recommendation rationale. Serve the actual recommendations from a low-latency retrieval system populated by those enriched signals.

# Architecture: async enrichment, sync serving

enrichment_pipeline:
  trigger: product_catalog_update | user_segment_refresh
  agent_tasks:
    - generate_product_semantic_tags
    - compute_cross_sell_reasoning_embeddings
    - update_personalization_profiles
  target_latency: "< 30 seconds (async, background)"
  output: enriched_feature_store

recommendation_service:
  type: synchronous
  implementation: vector_similarity + collaborative_filter
  data_source: enriched_feature_store
  target_latency: "< 50ms p99"
  llm_involvement: none

The pattern generalizes: agents are well-suited for pre-computation, post-processing, and asynchronous enrichment. They struggle when they're in the critical path of a synchronous request with tight latency budgets.

Anti-Pattern 3: High-Accountability Domains Without Adequate Human-in-the-Loop Design

Agents are being piloted in healthcare diagnosis support, legal document drafting, financial advice generation, and similar high-stakes domains. There's legitimate value to unlock in all of these areas—but teams frequently underestimate what "high-accountability" actually demands from their architecture.

The issue isn't that agents can't contribute meaningfully in these domains. They can. The issue is that organizations sometimes deploy agents in accountability structures that treat them as decision-makers rather than decision-support tools, then discover they've created liability they didn't anticipate.

Specific failure modes we've observed:

Attribution collapse: A multi-agent pipeline produces a medication interaction warning. Which agent produced it? Which model version? What context was it working from? Which tool calls informed it? If you can't answer these questions in under five minutes from an audit log, you have an accountability gap that regulators and plaintiffs will find.

Confidence miscalibration: LLMs produce confident-sounding output regardless of their actual epistemic state. An agent summarizing a contract may state "this agreement does not include a limitation of liability clause" when it hallucinated past a clause on page 34. Without ground-truth verification, that confident error becomes a costly one.

Approval theater: Teams add human-in-the-loop review steps but design them so that reviewers see only the agent's conclusion, not the reasoning chain, tool outputs, or the source documents the conclusion was drawn from. The human approves in 4 seconds because they have no real basis to do otherwise. This is compliance theater, not actual oversight.

If you're operating in a high-accountability domain, the decision isn't binary between "use agents" and "don't use agents." It's a question of where in the workflow agent output requires structured verification before it can influence a consequential decision. That verification architecture needs to be designed before the agent, not bolted on after.

Minimum viable accountability infrastructure for high-stakes domains:

@dataclass
class AgentDecisionRecord:
    decision_id: str
    agent_id: str
    model_version: str
    timestamp: datetime
    input_context: dict          # Full context, not summary
    tool_calls: list[ToolCall]   # Every external call made
    reasoning_trace: str         # Chain-of-thought if available
    output: str
    confidence_indicators: dict  # Token probabilities, self-assessed uncertainty
    human_reviewer_id: str | None
    review_timestamp: datetime | None
    review_outcome: str | None   # Approved / Modified / Rejected
    review_notes: str | None
    audit_hash: str              # Tamper-evident hash of full record

If your current infrastructure can't populate this record for every consequential agent action, you're not ready for high-accountability deployment.

Anti-Pattern 4: Tasks Where Consistency Is More Valuable Than Capability

There's a class of workflow where what you need isn't a smart, generalizing system—it's a system that does exactly the same thing every time, with zero drift.

SOC 2 compliance evidence collection is a good example. Every quarter, your team needs to pull the same set of artifacts from the same systems, apply the same classification logic, and produce documentation in a format your auditors have approved. There is no ambiguity in this task. There is no edge case that benefits from creative reasoning. What matters is that artifact #47 is always retrieved the same way, documented the same way, and traceable to the same process.

An agent introduces model drift (model providers update underlying weights), prompt sensitivity (a slightly different phrasing of a system prompt may produce slightly different output), and variance that creates audit friction rather than reducing it. A deterministic script checked into version control, with its own audit log, is a better tool for this job by every measure.

The meta-heuristic: if task consistency is a first-class requirement, and if consistency means "identical process, not just identical outcome," prefer deterministic automation.

This also applies to many financial reconciliation workflows, regulatory reporting pipelines, and any process that will itself be audited. Auditors can validate a script. Validating the emergent behavior of an LLM across thousands of documents is a fundamentally different—and much harder—challenge.

Anti-Pattern 5: Early-Stage Problems That Aren't Understood Yet

There's a specific failure mode that hits ambitious teams: deploying agents to automate a process before the process itself is well-understood or stable.

When an engineering team is still discovering edge cases in their customer escalation workflow—still learning what information is needed to resolve certain ticket categories, still negotiating what "resolved" means across different product areas—an agent deployed into that workflow will encode and amplify the current confusion. It will optimize for a local understanding of the problem that may be materially wrong in ways that take months to surface.

The rule of thumb from lean manufacturing applies: don't automate a mess. First, run the process manually long enough to achieve stable understanding of:

What inputs are genuinely sufficient to make a good decision
What the distribution of edge cases looks like in production
Where human judgment currently adds the most value
What the failure modes of poor decisions look like, and their downstream costs

Once you have 90 days of production data and a documented, stable process, you have enough to design an agent workflow that actually matches the problem shape. Before that point, you're building a system optimized for an incomplete mental model.

This is also why we recommend against agents for new product features in their first few weeks of operation. The feedback loops needed to tune agent behavior don't exist until real users interact with the feature at scale.

Anti-Pattern 6: When the Integration Surface Is the Real Problem

A significant fraction of "we need an agent for this" requests are actually "we need better APIs and data infrastructure for this" requests in disguise.

A team wants an agent to pull data from their CRM, cross-reference it with their billing system, and produce renewal risk scores. The framing is "AI-powered churn prediction." The reality is that the CRM and billing system have incompatible data models, no documented API for bulk export, and fields whose meaning has drifted over three years of schema evolution.

An agent doesn't fix this. The agent will successfully retrieve data from both systems and produce renewal risk scores that are plausibly worded but analytically meaningless because the underlying data has not been harmonized. The team ships something that looks like a solution, and they discover three months later that the risk scores correlate poorly with actual churn.

The correct diagnosis is a data infrastructure problem: schema documentation, an ETL layer that enforces semantic consistency, and a data model that actually represents the business domain. Once that infrastructure exists, a much simpler predictive model (or, yes, an agent if the task genuinely benefits from language understanding) can operate on data that means what it appears to mean.

Before architecting an agent solution, ask: would a junior data engineer, given clean, well-documented data, be able to write a straightforward program to solve this? If yes, fix the data first.

The Decision Framework: A Practical Checklist

Use this checklist before committing to an agent-based architecture. If you're answering "no" or "uncertain" to several of these, treat it as a signal to reconsider the approach or the scope.

Ambiguity and Reasoning Requirements

Does the task involve genuine ambiguity that resists explicit rules? (natural language understanding, context-dependent judgment, open-ended synthesis)
Does the problem space change frequently enough that hard-coded logic would require constant maintenance?
Is generalization across similar-but-not-identical inputs genuinely valuable, not just a nice-to-have?

Latency and Reliability

Can the workflow tolerate p99 latencies in the 1–10 second range?
Is the task asynchronous or can it be made asynchronous without degrading user experience?
Can the system degrade gracefully if the LLM provider experiences an outage or rate limit?

Accountability and Auditability

Can you log the full context, tool calls, and reasoning trace for every agent action?
Is the human-in-the-loop review step, if required, designed to give reviewers enough information to make a meaningful judgment?
Is the output of the agent decision-support (informing a human decision) rather than a decision itself, unless you have explicit validation that direct action is safe?

Process Maturity

Is the underlying process stable and well-understood, with documented edge cases?
Do you have at least 60–90 days of production data on how the task is currently handled?
Have you mapped the downstream consequences of agent errors at the P95 and P99 case?

Infrastructure Readiness

Is the data the agent will work with clean, semantically consistent, and well-documented?
Would a simpler automation—a script, a rule engine, a traditional ML model—fail to meet the requirements? (If a simpler tool would work, use it.)

If you pass this checklist, agents are likely the right architecture. If you're failing more than two or three items, you probably have prerequisite work to do before agent deployment becomes the right call.

What to Do Instead

When agents aren't the right fit, the alternatives aren't "give up on automation":

Deterministic scripting and rule engines for well-defined, consistent workflows. Still automating. Still reducing manual work. Just not probabilistic.

Traditional ML models for prediction tasks with structured features and stable label definitions. Gradient-boosted trees still outperform LLMs on many tabular prediction problems, run 100x faster, and are far more interpretable.

Retrieval-augmented search for knowledge access problems where users need to find and read information, not have it synthesized and acted upon. Often a well-indexed document store with good semantic search does the job with far less operational complexity.

Workflow automation platforms (yes, including the classic ones) for process orchestration that is fundamentally about moving structured data between systems according to defined rules.

Agents with scoped, well-bounded tasks rather than broad, open-ended mandates, when you do use them. An agent that does one clearly defined thing well is dramatically more reliable than an agent tasked with "handle our customer support workflow end-to-end."

Conclusion

The maturity of the AI agent ecosystem is real, and the use cases where agents genuinely outperform alternatives are growing. But the social pressure to deploy agents everywhere is outpacing the engineering rigor needed to deploy them well.

The most expensive mistakes we see aren't failed agent experiments—those are recoverable. The expensive mistakes are agents deployed into production workflows that should have been rule engines, agents making decisions in accountability structures that weren't designed to support autonomous AI action, and agents built on top of data infrastructure that was never ready to support reliable automation of any kind.

Use agents when ambiguity, generalization, and language understanding are actually required by your problem. Reach for simpler tools when they're sufficient. The engineering discipline of matching tool to problem—resisting the pull toward the sophisticated solution when the simple one works—is the same discipline that's always separated good systems from expensive ones.

The teams getting the most value from AI agents right now are not the ones deploying agents most aggressively. They're the ones who deployed agents in the right places, understood why they were doing it, and built the governance infrastructure to know when things go wrong.

AI AgentsStrategyArchitectureDecision MakingEnterprise

← Back to all posts