The real cost of building an in-house LLM stack

Per-token pricing is the tip of the iceberg. By the time we're three months into a serious LLM deployment, the per-token bill is rarely the biggest line item. Here's the honest breakdown.

The five costs nobody warns you about

1. Inference (yes, the obvious one)

Frontier models from OpenAI and Anthropic have come down dramatically. For most use cases you're looking at $5–$30/million input tokens and $15–$120/million output tokens depending on the tier. Open-weights models you self-host can be 3–10× cheaper at scale, but factor in GPU rent.

2. Embeddings + vector storage

Every RAG system needs embeddings, and you will re-embed every time your chunking strategy changes (which is more often than you would guess). Plus the vector store (Pinecone, Weaviate, pgvector) has its own monthly bill. Budget $200 to $2k per month for a serious deployment.

3. Evaluation

If you are doing it right, you are running evaluations on every change. LLM as judge calls add up fast. A 200 question eval suite running on every pull request can hit $50 to $200 per run. We typically see clients spend more on evaluations than on production traffic in the first six months.

4. Observability

Datadog, Langfuse, Helicone, or homegrown. You need observability one way or another. Without it you are flying blind, and the first time you find a quality regression in production, you will wish you had paid the bill. Budget $200 to $1k per month.

5. The "human-in-the-loop" team

Almost every production AI system needs humans labeling failures, refining prompts, and curating the evaluation set. That is headcount, not infrastructure, but it is a real cost line and the one most teams forget.

A typical year-one budget

For a mid-size deployment serving ~100k requests/day:

Category	Monthly cost
Inference (GPT-4-class)	$3k–$12k
Embeddings + vector DB	$400–$2k
Evaluation	$500–$3k
Observability + logging	$300–$1k
Engineering ops	$5k–$15k
Total	$9k–$33k/month

Before you flinch, remember: that is typically replacing $50k to $200k per month of human work. The return on investment is real. Just go in with eyes open.

How we'd phase it

Months 1–3: Frontier APIs, lightweight evals, minimum observability. Prove the value.
Months 4–9: Add proper evals + observability. Start measuring failure modes seriously.
Months 10+: Consider self-hosting open-weights models for the high-volume paths. Distillation if you can pull it off.

The mistake we see most often is teams jumping to step 3 before they've validated step 1. Don't optimize before you have the data to optimize against.

The real cost of building an in-house LLM stack

Got an idea? Let's make it real.