Skip to content
Editorial · rag retrieval

A free embedding model just changed your RAG cost math

IBM's Granite Embedding R2 is Apache-2.0, sub-100M parameters, 32K context — and it runs on hardware you already own.

May 15, 2026· 5 min read· Domani AI

IBM shipped a multilingual embedding model this month that fits in under 100M parameters, handles 32K-token context windows, and carries an Apache-2.0 license. If your RAG pipeline is still calling a commercial embeddings API for every document chunk and every query, the cost floor on that layer just dropped to near zero. The question for this week is not whether to evaluate it — it is how fast you can run the benchmark.

What changed in the Granite Embedding R2 release

IBM released Granite Embedding Multilingual R2 on Hugging Face under the Apache-2.0 license. The model sits below 100M parameters, which puts it in a weight class that runs comfortably on a single A10 or L4 GPU — the kind already provisioned in most mid-market cloud accounts. The headline capability is 32K token context, which is meaningfully longer than the 512-token limit that has forced chunking strategies on most production RAG systems. Multilingual support is built in, not bolted on, covering enough languages to matter for any company with European or APAC document sets.

The model was benchmarked against other sub-100M retrieval models and posts leading scores in that weight class. Apache-2.0 means no usage restrictions, no per-token fees, no data-leaving-your-infra clauses to negotiate, and no license review before you ship to a regulated customer. IBM positions this as the retrieval layer of its broader Granite 4.0 family, so updates are likely to follow a product roadmap rather than a research calendar.

Why this changes the math on embedding API dependency

Most RAG stacks built in 2023–2024 default to a commercial embeddings endpoint. The reasons were sensible at the time: fast to integrate, no GPU to manage, good baseline quality. The problem is that embeddings are called twice per retrieval — once at index time for every document chunk, once at query time for every user request. At scale, that is a predictable and compounding cost with no natural ceiling.

The 32K context window changes more than just cost. Short context limits forced engineers into chunking pipelines that introduce retrieval noise: a contract clause split across two chunks, a support ticket that loses its header, a policy document that answers the question only when read in full. With 32K context, entire documents can be embedded as single units. That simplifies the ingestion pipeline, reduces the surface area for retrieval errors, and makes re-ranking logic easier to reason about. Fewer moving parts is a risk reduction, not just a convenience.

The multilingual capability matters for a specific customer segment that rarely gets addressed directly: companies running internal tools across regions, or SaaS vendors selling into non-English markets. Maintaining separate embedding models per language, or accepting quality degradation from an English-primary model, are both avoided here. One model, one deployment, one cost center.

Talk to Domani AI about building this →

What a CTO should do this Monday morning

The move this week is a contained benchmark, not a full migration. Scope it to 3 days of engineering time and produce a single output: a cost delta number over 12 months, using your actual embeddings volume.

Here is the calculation frame:

  • Pull your embeddings API bill for the last 30 days. Multiply by 12 for an annualized baseline.
  • Estimate self-hosted cost — a single L4 GPU on Google Cloud runs roughly $0.80–$1.20/hour on-demand, less on committed use. For most pipelines under 10M tokens/day, one GPU handles the load with headroom.
  • Run Granite R2 on a 1,000-document sample of your actual corpus. Compare retrieval quality against your current embeddings using your existing eval set. If you do not have an eval set, build a 50-question golden set this week — that work is overdue regardless.
  • Check your license posture — Apache-2.0 means no legal review for most enterprise contexts, but confirm with your counsel if you operate in a regulated sector.

The migration itself is low-risk. Embedding models sit at the ingestion and query layer, behind your vector store. Swapping the model requires re-indexing your corpus (a one-time operation) and updating the query embedding call. It does not touch your LLM, your prompt logic, or your application layer. A competent engineer can run a parallel index and A/B the retrieval quality before you decommission the API dependency entirely.

What it costs — and what it realistically saves

The honest trade-off is operational ownership. A commercial API gives you zero infrastructure surface area. Self-hosting Granite R2 means you own the deployment, the scaling logic, and the uptime SLA. For a team already running GPU workloads — which most AI-forward companies at 50–500 FTE are — this is an incremental burden, not a new capability. For a team with no GPU infrastructure today, the calculus is different: factor in setup time (estimate 2–4 days for a containerized deployment with a health check and autoscale policy) and ongoing ops overhead before you commit.

On the savings side: companies embedding more than 500M tokens per month against a commercial API are typically spending $500–$2,000/month on that line item alone, depending on the provider and tier. Self-hosted on a single reserved GPU brings that to roughly $400–$600/month in compute, with quality at least equal to commercial sub-100M alternatives. Above 2B tokens/month, the delta is significant enough to justify a dedicated embeddings service with redundancy. Below 50M tokens/month, the savings are real but not transformational — the stronger argument is data residency and context length, not cost. Know which argument applies to your situation before you schedule the migration.

Have a similar build in mind? → Start the conversation

Start the conversation →
A free embedding model just changed your RAG cost math · Domani AI