TL;DR
Process images once at indexing time with cheap vision models, store as text descriptions, retrieve alongside text chunks to reduce per-query overhead by 94-99% while improving answer quality.
Key Points
- Query-time multimodal vision adds 27-51% cost per query; indexing-time captions add only 1-6% overhead
- Separate caption chunks beat inline storage: 6% cost increase with GPT vs 19% inline; Claude sees slight cost reduction
- Small vision models (GPT-4 mini) produce near-identical caption quality to models 4x more expensive
- Images placed correctly in context 94-99% of the time across three production projects
Why It Matters
For RAG systems serving millions of queries, this inverts the economics of multimodal AI: paying once to understand images at scale beats paying per-query vision costs. The approach works because it recognizes that most images in technical docs either clarify existing text or contain load-bearing data like tables and diagrams—both cases benefit from transcription at ingestion rather than repeated pixel inspection.
Source: www.kapa.ai