Kapa Indexes Images as Text to Cut RAG Query Costs by Half

TL;DR

Process images once at indexing time with cheap vision models, store as text descriptions, retrieve alongside text chunks to reduce per-query overhead by 94-99% while improving answer quality.

Key Points

Query-time multimodal vision adds 27-51% cost per query; indexing-time captions add only 1-6% overhead
Separate caption chunks beat inline storage: 6% cost increase with GPT vs 19% inline; Claude sees slight cost reduction
Small vision models (GPT-4 mini) produce near-identical caption quality to models 4x more expensive
Images placed correctly in context 94-99% of the time across three production projects

Why It Matters

For RAG systems serving millions of queries, this inverts the economics of multimodal AI: paying once to understand images at scale beats paying per-query vision costs. The approach works because it recognizes that most images in technical docs either clarify existing text or contain load-bearing data like tables and diagrams—both cases benefit from transcription at ingestion rather than repeated pixel inspection.

Read the full technical deep-dive

Source: www.kapa.ai