Per-token pricing is the tip of the iceberg. By the time we're three months into a serious LLM deployment, the per-token bill is rarely the biggest line item. Here's the honest breakdown.
The five costs nobody warns you about
1. Inference (yes, the obvious one)
Frontier models from OpenAI and Anthropic have come down dramatically. For most use cases you're looking at $5–$30/million input tokens and $15–$120/million output tokens depending on the tier. Open-weights models you self-host can be 3–10× cheaper at scale, but factor in GPU rent.
2. Embeddings + vector storage
Every RAG system needs embeddings, and you will re-embed every time your chunking strategy changes (which is more often than you would guess). Plus the vector store (Pinecone, Weaviate, pgvector) has its own monthly bill. Budget $200 to $2k per month for a serious deployment.
3. Evaluation
If you are doing it right, you are running evaluations on every change. LLM as judge calls add up fast. A 200 question eval suite running on every pull request can hit $50 to $200 per run. We typically see clients spend more on evaluations than on production traffic in the first six months.
4. Observability
Datadog, Langfuse, Helicone, or homegrown. You need observability one way or another. Without it you are flying blind, and the first time you find a quality regression in production, you will wish you had paid the bill. Budget $200 to $1k per month.
5. The "human-in-the-loop" team
Almost every production AI system needs humans labeling failures, refining prompts, and curating the evaluation set. That is headcount, not infrastructure, but it is a real cost line and the one most teams forget.
A typical year-one budget
For a mid-size deployment serving ~100k requests/day:
| Category | Monthly cost |
|---|---|
| Inference (GPT-4-class) | $3k–$12k |
| Embeddings + vector DB | $400–$2k |
| Evaluation | $500–$3k |
| Observability + logging | $300–$1k |
| Engineering ops | $5k–$15k |
| Total | $9k–$33k/month |
Before you flinch, remember: that is typically replacing $50k to $200k per month of human work. The return on investment is real. Just go in with eyes open.
How we'd phase it
- Months 1–3: Frontier APIs, lightweight evals, minimum observability. Prove the value.
- Months 4–9: Add proper evals + observability. Start measuring failure modes seriously.
- Months 10+: Consider self-hosting open-weights models for the high-volume paths. Distillation if you can pull it off.
The mistake we see most often is teams jumping to step 3 before they've validated step 1. Don't optimize before you have the data to optimize against.