AI Operations Layer over ERP: 90-Day Implementation Roadmap

How to Wire an AI Operations Layer Over Existing ERP: A 90-Day Roadmap

TL;DR

You do not need to replace your ERP to add AI to your operations. The AI Operations Layer pattern sits on top of Logo, SAP, Microsoft Dynamics, Salesforce, HubSpot and Outlook without touching them, reads the data those systems already hold, produces AI-driven operational decisions, and writes the resulting actions back into the source systems with a human approval gate. This guide walks through the 90-day implementation roadmap week by week — discovery in weeks 1-2, foundation in weeks 3-4, first workflow in weeks 5-8, second workflow in weeks 9-12, and operational handover in the last two weeks. It also names the four architectural decisions that most predict success, the six roles you actually need on the team, and the eight checkpoints that separate a program that ships from one that stalls.

Why this roadmap looks different from typical AI implementation guides

Most published "AI implementation roadmaps" treat AI as an infrastructure project — data warehouse first, model training second, deployment third. This works when the goal is to build a machine learning model. It does not work when the goal is to make Monday morning operations decisions faster.

The AI Operations Layer pattern, which we defined in our What Is the AI Operations Layer guide, inverts the traditional order. It starts with the operational decision that needs to move faster, works backward to the data needed to make that decision, connects to the systems already holding that data, adds the AI reasoning layer on top, and orchestrates the action back into the source system where the operator already works. The infrastructure is a consequence of the operational goal, not the starting point.

This roadmap is built around that inversion. Each week has an operational deliverable, not a technical milestone.

The four architectural decisions you must make in the first two weeks

Everything downstream is easier or harder depending on how you answer these:

1. Which operational workflow does the layer target first? Not "AI in general" — a specific workflow with a named owner and a measurable outcome metric. Late-collections receivables, at-risk-customer alerts, stock-depletion prevention, and quote-to-cash cycle time compression are the four most common first workflows. Pick the one where the pain is loudest and the outcome is measurable in dollars or hours.

2. Which model routing pattern will you use? The healthy default is model-agnostic — a thin abstraction that lets different agent steps route to different LLM providers (OpenAI, Anthropic, Google, open-weight) based on cost, latency, or capability. Locking into one provider on day one is a mistake you pay for on day 180.

3. What is the write-back approval gate? For the first 90 days, the layer should never take an action in a source system without a human clicking "send" or "approve." The AI drafts the message, opens the task, prepares the credit-hold recommendation — a human owns the trigger. Full autonomy is a phase-two decision, not a launch decision.

4. Where does the institutional memory live? Every action the layer takes, every human approval, every outcome — logged and structured for later evaluation. Postgres + a lightweight event schema is enough for the first 90 days. Fancy vector databases are optional at this stage.

Answer these four in the first two weeks and the 90-day roadmap becomes execution. Skip them and you will be renegotiating scope in week eight.

Week-by-week — the 90-day plan

Weeks 1-2: Discovery and architectural anchoring

Goal: Answer the four architectural decisions above. Choose the first workflow. Confirm the source systems and the data available.

Deliverables:

Signed one-page discovery brief: first workflow, business KPI, target outcome, source systems, model routing policy, approval gate design, memory schema
Access to read-only credentials for the ERP, CRM, and email systems in scope
Named team: business owner (the person whose weekly job improves), technical lead, one senior engineer, one prompt/evaluation engineer, one platform engineer, one PM
Baseline metric captured: what does the workflow's outcome look like today, before the layer? You need this number to prove the layer worked

Failure mode: Skipping the baseline. If you do not know what "before" looked like, you will not be able to defend the "after" numbers. Two-day investment, career-long value.

Weeks 3-4: Foundation build

Goal: Connect the layer to the source systems and unify the data into a working semantic layer. No AI yet.

Deliverables:

Read-only connectors live to each source system (ERP + CRM + email as the minimum starting set)
Unified customer entity: the layer can pull "everything we know about Customer X" from all three sources in under a second
Institutional memory table live in Postgres, receiving test events
Basic dashboard showing the unified view for a sample of 10 customers so the business owner can validate

Why this comes before the AI: If the unified view is wrong, the AI reasoning on top will be wrong faster. Fix the data foundation first.

Failure mode: Rushing the connector layer. If ERP data is stale by 24 hours or CRM has three different customer identifiers that do not resolve, the layer's decisions will inherit those gaps. This is the largest predictor of program failure at week 12.

Weeks 5-8: First workflow build

Goal: Ship the first operational workflow end to end — decision production, human-approved action, outcome logging.

Week 5-6 deliverables:

Decision logic implemented for the target workflow (e.g. "identify customers with degraded payment behavior relative to their own 12-month baseline")
LLM-drafted action packaged for human review (e.g. draft follow-up email, task attached to the account owner in CRM with relevant invoice context)
Evaluation harness running against a labeled test set: 50-100 real historical cases the business owner has classified as "should escalate" vs "should not"
Success rate against the harness must exceed 80% before the workflow goes live

Week 7-8 deliverables:

Workflow live in the source system for the responsible team
Daily dashboard: how many decisions did the layer flag, how many did the human approve, what was the outcome
First week of production data logged
Retrospective at end of week 8: what worked, what needs tuning, what breaks

Failure mode: Going live without the evaluation harness. Every LLM-driven workflow needs a labeled test set that regression-catches new prompt versions. Without it, the first prompt change silently degrades quality and nobody notices until customers complain.

Weeks 9-12: Second workflow + hardening

Goal: Prove the pattern generalizes. Ship a second workflow using the foundation from weeks 3-4. Harden the first workflow with production feedback.

Week 9-10 deliverables:

Second workflow scoped (typically a different operational KPI — if the first was collections, the second might be customer-loss prevention or stock alerts)
Second workflow uses the same data unification layer, adds domain-specific decision logic on top
First workflow gets its production tuning: prompts refined based on human approval-vs-rejection patterns, evaluation harness extended with newly seen cases

Week 11-12 deliverables:

Second workflow live
Consolidated dashboard: both workflows, cross-workflow metrics (total hours saved, total decisions flagged, total actions executed)
Operations runbook completed: how the responsible team operates the layer day to day, what to do when it breaks, who to escalate to
90-day outcome report: baseline vs current for both workflows, dollar impact, next-quarter roadmap

Failure mode: Treating workflow two as a copy of workflow one. Each workflow has its own decision logic, evaluation set, and human-in-the-loop design. Copy-pasting the first workflow saves engineering time but produces a second workflow that does not actually improve the second KPI.

Week-by-week deliverables — one-page summary

Weeks | Phase | Key deliverable

1-2 | Discovery | Discovery brief + baseline metric + named team

3-4 | Foundation | Read-only connectors + unified customer entity + memory schema

5-6 | Build W1 | Decision logic + evaluation harness passing 80%+

7-8 | Ship W1 | Workflow live + daily dashboard + retrospective

9-10 | Build W2 | Second workflow scoped + first-workflow tuning

11-12 | Ship W2 | Workflow two live + runbook + 90-day outcome report

If any week slips, the next week's deliverable slips. Do not try to catch up by compressing later weeks; the compression always kills the evaluation harness first, and the evaluation harness is what keeps the layer trustworthy.

The six roles you actually need on the team

Fewer than this and something drops. More than this and coordination overhead outweighs the added capacity.

Business owner — the person whose weekly job improves. Not the CIO. The actual head of collections, or head of ops, or CFO if the workflow is finance. Signs the discovery brief, defines the KPI, approves the go-live.
Technical lead — architect-level engineer. Owns the four architectural decisions. Present in every week.
Senior engineer — writes the connector layer and orchestration. Full-time weeks 1-12.
Prompt / evaluation engineer — builds the labeled evaluation set, tunes prompts, runs the harness. Part-time weeks 1-4, full-time weeks 5-12.
Platform engineer — infrastructure, observability, cost tracking, deployment. Part-time throughout.
PM / program manager — keeps the discovery deliverables on track, runs the weekly stand-up, owns the risk log. Part-time throughout.

If you cannot staff all six from your organization, this is where a specialist partner earns its fee — we typically bring three or four of these roles (technical lead, senior engineer, prompt engineer, platform) and pair with your business owner and PM.

The eight checkpoints that separate ship from stall

Programs that pass all eight ship. Programs that fail more than two of these usually do not.

Baseline captured before build starts. Without it, "we improved" is a claim, not a measurement.
Business owner attends every weekly review. If they cannot, the layer is not really their priority and the go-live will get delayed.
Read-only credentials provisioned by end of week 2. If IT has not moved by week 2, they will not move by week 4.
Evaluation harness at 80%+ before go-live. Live traffic will surface new cases; you need a strong starting baseline.
Human approval gate in every action for the first 90 days. Do not let anyone talk you into "just this one action can go automatically" until the harness is trusted.
Weekly outcome dashboard reviewed by the business owner. Not the PM. Not the technical lead. The person whose KPI moves.
First customer complaint or false-positive resolved within 48 hours. How you handle the first real production error signals whether the layer is trustworthy.
90-day retrospective documents what to change in the next quarter. No documented next quarter means the program becomes maintenance-mode and stops improving.

What comes after day 90

The 90-day roadmap is a starting point, not a finish line. The organizations that get the most value from an AI Operations Layer treat quarter one as the proof-of-concept quarter, quarter two as the workflow-expansion quarter (typically five to seven more workflows added), and quarter three as the trust-graduation quarter — the point at which some low-risk workflows earn full autonomy and the human approval gate is relaxed for those specific actions.

Autonomy graduation is workflow-by-workflow. A late-payment reminder email that has been correctly drafted 98% of the time over 500 samples can graduate to full send. A credit hold decision that would have expensive consequences if wrong does not, ever, until the pattern is very well understood. Some decisions never graduate. That is fine. The layer's value is not full autonomy — it is compressed decision cycle time with correctness maintained.

Where Internative fits

Internative ships the AI Operations Layer implementation pattern described above through our Koordex product line. The 90-day roadmap is our standard first-engagement structure. We typically bring the technical lead, senior engineer, prompt/evaluation engineer, and platform engineer; you bring the business owner and the PM. The pilot pricing sits at the pilot end of the Aissist alternatives comparison, and the outcome pattern we ship to matches the Koordex mid-market distributor case study.

For further reading on the architecture pattern, see What Is the AI Operations Layer and AI Operations Layer vs MLOps vs LLMOps for how the layer relates to adjacent AI infrastructure categories. For a broader view of how to evaluate any AI implementation partner, see How to Evaluate an AI Agent Development Vendor.

Frequently asked questions

Can I really implement an AI Operations Layer in 90 days?

Yes, for a single workflow on top of an established ERP + CRM stack with reasonable data quality. Multi-workflow programs typically ship the first workflow in 90 days and add subsequent workflows in 60-day increments as the foundation stabilizes. Companies that miss the 90-day mark usually miss because IT credential provisioning slipped in weeks 1-2 or because the evaluation harness was skipped in weeks 5-6.

What if my ERP is highly customized or old?

The layer can still overlay a customized ERP as long as read-only API access or database access exists. Very old ERP installations sometimes require an integration adapter that adds 2-3 weeks to weeks 3-4. If the ERP has no API and no database access, the layer cannot reach it — this is the one hard blocker. A rare situation for Logo, SAP, Dynamics, or Netsuite; more common with heavily-customized legacy systems.

Do I need to build the evaluation harness or can we skip it?

Do not skip it. The evaluation harness is what tells you whether a prompt tweak in week 20 accidentally degraded workflow accuracy by 15%. Without it, degradation is invisible until customers complain. Building the harness in weeks 5-6 takes about two developer-weeks of effort. Skipping it saves those two weeks and creates a system that erodes silently. Trade-off is not worth it.

How much does a 90-day AI Operations Layer implementation cost?

Cost varies with scope, team composition, and existing-system complexity. The healthy first-engagement structure sits in the pilot-plus-first-production-workflow range. Buyers evaluating this should ask for the scope and outcome commitment, not a headline number — the same 90 days at Big-4 rates and at specialist rates can differ by 3-5x for the same operational outcome. See our Big-4 vs Specialist AI Consulting guide for the vendor-tier tradeoff.

What is the smallest useful first workflow?

Collections follow-up is often the highest-ROI first workflow because it has a clear dollar outcome (recovered receivables), a defined trigger (payment behavior deviation from baseline), and a human approval gate that account owners are already comfortable with (they already draft follow-up emails). Stock-depletion alerts are a close second for distribution and retail. Customer-loss prevention (proactive outreach when ordering pattern breaks) is a strong third when the customer base is large enough to make retention improvement material.

Should I hire in-house or work with a vendor for the first 90 days?

Work with a vendor for the first 90 days unless you have a senior AI engineering team already in place. The pattern is subtle in places (evaluation harness design, model routing, approval-gate UX), and the cost of learning it in-house is 6-9 months of ramp before you ship the first workflow. A specialist vendor ships the first workflow in 90 days and transfers ownership to your team by month 6. Most mid-market organizations should follow that path.

What happens if the first workflow underperforms?

You use the 90-day retrospective to name why. The three most common causes are: (1) the wrong first workflow — the operational pain was there but the KPI was not measurable enough to defend the impact; (2) data foundation gaps — the source-system data had integrity issues nobody flagged in discovery; (3) approval-gate friction — the responsible team did not adopt the daily dashboard, so decisions the layer flagged never turned into actions. Each has a defined remediation, and none is fatal to the program.

How does this differ from workflow automation tools like Zapier or n8n?

Workflow automation tools trigger a defined action when a defined event occurs. An AI Operations Layer decides which event matters based on AI reasoning over unified data, packages a context-rich action for human review, and remembers what happened for continuous improvement. For very simple use cases, workflow automation is sufficient. For complex operational decisioning across many systems, it is not. The 90-day roadmap above does not translate meaningfully to a workflow automation tool; that tool sits at a different layer of the stack.