Research

What 5 production AI builds taught me about cost, latency, and time-to-ship

Anonymized data and concrete lessons from shipping AI features into production over 18 months, working solo.

By Uros · Top Rated Plus on UpworkUpdated May 11, 2026 · 8 min read · Research

TL;DR

Over 18 months we shipped 5 production AI platforms as a one-lead shop. This post is the anonymized data: what we measured, what cost what, where AI saved real hours, and where it cost more than it returned. Headline numbers: average build time 2–8 weeks, model spend $200–$1,800/month per platform at production volume, ticket-triage automation handled 60–80% of tier-1, and end-to-end p95 latency for streamed Claude responses ran 1.2–2.4 seconds. No vendor names, no client names — under NDA, as everything we ship is.

The 5 builds, in one paragraph each

MSP operations platform. Multi-tenant ticketing for an IT managed-services provider running 1,000+ daily operations. Built in 6 weeks. Claude classifies inbound tickets, drafts dispatch SMS to field engineers, and routes the predictable categories straight to the queue. Tier-1 triage is now 80% automated; the human-in-loop queue holds the rest. Stack: Next.js 15, Postgres, Drizzle, Inngest for the triage pipeline, Claude Sonnet 4.6 with prompt caching.

Telemedicine platform. Doctor portal, video consultations, e-prescriptions, Stripe billing, S3 for clinical documents. 8 weeks. HIPAA-aware architecture — and crucially, no AI in PHI paths. Claude only touches structured summary metadata for doctor review, never raw transcripts or patient identifiers. The audit trail is more valuable than the automation here.

Legal IP management. Multi-attorney workflow, client portfolios, deadline tracking. Built without AI in v1 on purpose. Workflow software first; AI is on the v2 roadmap for prior-art summarization once the base data model is mature enough to make retrieval useful. This was the most disciplined call we made — and the easiest to defend in scoping.

UK digital pharmacy. Regulated pharmacy storefront, Stripe Checkout, prescription validation, dispatch flow. 10 weeks. Claude drafts product descriptions from structured regulatory metadata for review, never publishing directly. Every AI artifact carries a draft-by/reviewed-by audit row. The regulator's expectation is provenance, not autonomy.

Multi-LLM enterprise AI platform. Provider-agnostic chat-and-tool surface for an enterprise internal team. Routes between Claude, GPT, and Gemini behind a single API. DLP filter on inbound, structured audit logging on every turn, per-tenant cost telemetry, role-scoped tool access. This is the build where everything we learned about cost discipline got crystallized into code.

Cost

Production model spend across the 5 builds ran between $200/month (smallest, low-volume internal use) and $1,800/month (highest, multi-tenant with hundreds of triages per day). That spread is real and worth internalizing: cost per build is not a function of how clever the AI is — it's a function of volume and prompt structure.

The single highest-ROI lever we found is prompt caching. On the two builds where the system prompt is static — same routing rules, same tool definitions, same examples on every call — caching cut Claude input cost 60–72%. The work to enable it was a single SDK flag and a one-time prompt restructure to put the stable content first. There is no excuse for not running this on any workload that re-uses a system prompt more than a few times per minute.

Where the spend ran high, the cost driver was almost always one of three things: (1) re-sending large context on every turn instead of caching, (2) reaching for a larger model where Sonnet was already sufficient, or (3) retrying on flaky tool calls without an exponential backoff. We instrumented per-tenant cost telemetry on the multi-tenant builds and watched the bill drop within the first week of having the dashboard visible. Visibility is a feature.

Latency

p95 for streamed Claude responses, measured end-to-end including UI render, ran 1.2–2.4 seconds. Cold first-token landed at 600–1,100ms. Streaming masks the rest — users perceive the response as arriving by the time the first sentence is on screen, even if the full completion takes another second.

Where we had to call synchronously — because the next step needed the full response, or because the surface was a webhook handler with a strict timeout — full-response calls cost 3–6 seconds. On every build where this came up, we redesigned the UI around streaming to avoid that bucket. If the user is staring at a spinner for four seconds, you have a UX problem the model can't fix.

Two further notes. First, function calling for structured output is slightly slower than free-text — the model has more constraints to satisfy — but it eliminates a whole class of parse-and-retry cycles that were costing us more time on the other end. Second, the latency difference between regions (Claude on Anthropic vs. Claude on AWS Bedrock vs. Claude on Vertex) is rarely the bottleneck. The bottleneck is your serialization, your auth check, and whatever middleware sits between the request and the SDK.

Time-to-ship

Build-tier engagements — a single workflow with auth and payments — shipped in 2–4 weeks. Scale-tier engagements with multi-tenant, multi-role, real reporting — 4–8 weeks. These numbers don't move much across the 5 builds because the stack doesn't move much: Next.js 15, Postgres, Drizzle, NextAuth or Clerk, Stripe, Vercel. Familiar tools compound.

The longest lead time wasn't engineering. It was waiting on the client's compliance review of AI-generated output for the regulated builds. For the UK pharmacy, every AI-drafted product description had to be approved by their superintendent pharmacist before going live. Plan for 3–5 days of review per AI surface in regulated verticals, and surface a review queue in the admin UI from day one. Don't ship the AI and bolt on the review later — you'll be re-doing the data model.

What AI replaced — honestly

Numbers from the builds where we have actual measurement:

Tier-1 triage: 60–80% replaced. Classification + routing of the predictable categories. Escalation queue for the rest.
Drafting (visit summaries, dispatch SMS, product copy): humans now edit, they don't draft from a blank page. 3–5× faster per artifact. The model sets the floor; the human raises it.
Customer-facing reply drafts in support: 50–70% sent without edit on the MSP build, where the question shapes are well-known.
Decision-making (routing tickets to the right team, choosing which template to draft from): replaced for the common cases, escalation queue for the rest.

What AI didn't replace, in any of the 5 builds: irreversible actions, legal output, multi-step reasoning over fresh data without retrieval. We don't put AI on paths a user can't undo. We don't publish AI output as final without a named human reviewer. We don't expect the model to reason over data it has never seen — that's what RAG is for, and the retrieval has to actually work, not just be present.

Where AI cost more than it returned

Three honest losses:

Regulated-copy drafts took longer than human writing on one build. The pharmacy product descriptions: time-to-publish per artifact was higher with AI in the loop than without, because every draft pulled a pharmacist into a review they wouldn't have needed for human-written copy they already trusted. The fix wasn't to remove the AI — human throughput on net is higher with the floor in place — but to be honest that the per-artifact cycle slowed down before the volume came up.

Vector search lost to a Postgres LIKE query on the smallest build. We built a pgvector retrieval layer on a corpus of fewer than 2,000 documents. The embedding cost was real, the index build was real, and the retrieval quality was worse than a hand-tuned full-text query on the same data. We ripped it out. Below 10,000 documents, Postgres full-text search almost always wins. Vector is for when you have either scale or semantic queries the user isn't typing the keywords for.

Fine-tuning, attempted once, abandoned. On the MSP build we tried fine-tuning a smaller model on historic ticket data to drop inference cost. Three weeks later: prompt-engineering plus retrieval beat the fine-tuned model on quality, ran at 5% of the total cost when you include the training and re-training pipeline, and was an order of magnitude faster to iterate on. We don't reach for fine-tuning anymore unless the user has explicitly asked and we've exhausted prompt + RAG + tool use first.

Stack notes worth sharing

prompt caching is the highest-ROI Claude technique we use. Static system prompts, stable tool definitions, few-shot examples — put them first, mark them cacheable, watch the bill drop 60–70%. One SDK flag.
Function calling for structured output. We don't parse JSON out of free-text anymore. The model returns a tool call, the SDK gives us a typed object, we move on.
Streaming everywhere we can. Even when the consumer is another system, the first-token-to-first-byte latency matters for the eventual user-facing surface.
RAG against your data, not the internet. Every meaningful retrieval improvement we made on the enterprise build came from cleaning the indexed corpus, not changing the embedding model.
Postgres handles 90% of what teams reach for vector DBs for, under 10K docs. pgvector for embeddings, tsvector for full-text, JSONB for the messy middle. One database, three jobs, no extra vendor.
RLS on every multi-tenant build, day one. Tenant isolation enforced at the database layer means a bug in API code can't leak data across tenants. This is non-negotiable in any of the regulated builds.

What we'd do differently

Ship the eval harness before the feature. On builds where we wrote the eval first — held-out test set, automated regression on every prompt change, latency and cost tracked per swap — every later iteration was cheaper. On builds where we added evals retrofit, we paid that cost twice. custom evals are not optional.
Cost telemetry on day one, not week six. Per-tenant, per-feature spend logged into Postgres, surfaced in an admin view. Without it, the bill is a surprise; with it, the bill is a dashboard.
Provider-agnostic from the start, even on single-model builds. A thin abstraction at the SDK layer costs almost nothing on day one and saves entire weeks the first time a price or capability shift makes the other vendor the right answer.
Audit-log the AI like you audit-log payments. Every model call, every tool invocation, every redaction. On the multi-LLM enterprise build this turned out to be the feature the buyer actually needed — not the model routing, the audit trail.
Don't put AI on the irreversible path. Send the email, charge the card, file the prescription — these stay human-approved. The model drafts, the human decides. We've never regretted that line. We've regretted the times we got close to crossing it.

Wolrix is led by Uros — one technical lead per project, with proven specialists when the scope demands. If your project fits the shape above — $10K–$50K, 2–8 weeks, NDA-first — book a 15-min intro on Calendly or get a free architecture audit at /free-audit.

All numbers in this post are anonymized to ranges from real shipped work. Specific clients, specific platforms, and specific contractual figures remain under NDA — as every Wolrix engagement is by default.