RAG, introduced in a 2020 Facebook AI paper by Lewis et al., is a pattern in which an LLM first retrieves passages relevant to the user's question from an external store, then injects them into context before generating an answer. The approach unlocks fresh or proprietary data without retraining the model, reduces Hallucination, and makes Citation feasible. A typical pipeline splits documents via Chunking, turns them into Embeddings, and writes them to a Vector Database; at query time the most relevant chunks are pulled with Hybrid Search or pure dense retrieval. RAG has become the dominant baseline architecture for enterprise AI applications.
External Links