What. Another RAG article….
Retrieval Augmented Generation [RAG] is probably the most common architecture in practice for building generative AI applications. Numerous design patterns for RAG have evolved from indexing techniques, use of graph data bases, retrieval techniques, agentic workflows, RAFT [retrieval augmented fine tuning] etc… This hyper dynamic space demands you to design a modular, configurable architecture for a production grade RAD Gen AI application.
There is probably multiple RAG based articles and blogs. So why another one. Our focus here is dive deeper in to the different components of RAG, and considerations for configuring these components for a production workload. We will also dive a little bit deeper in to some of the options provided by AWS that eases / enables you to build configurable RAG applications.
For the sake of simplicity, we will divide the entire RAG application in to 3 different pipelines — Indexing pipeline, Retrieval pipeline, Generation pipeline. Indexing pipeline focusses on building a knowledge base from your systems of record, Retrieval pipeline focusses on retrieving the information relevant to the query and Generation pipeline focusses on generating the output.
Indexing Pipeline:
Building a Knowledge base(s) are fundamental for RAG. A knowledge base is built by ingesting data from your systems of record, vectorizing the data and storing it in a vector database. Indexing pipeline enables the creation of these vector databases. It is important to understand the steps of the pipeline
- Connectors: Your enterprise domain data could reside in different type of Systems of Record [SOR]. SOR includes but not limited to Object Stores like Amazon S3, ADLS, web sites, Data lakes, Data bases to SaaS based applications. You will have to use specialized connectors to connect to these SOR’s to extract information.
- Loaders: Data within these SOR’s could be in multiple formats. This ranges from unstructured data formats in PDFs, Word documents, PPT, img, png, tiff’s to semi structured data in JSON, XMLs to structure data in Databases. You will have to configure specialized loaders for each of the document types.
- Processors: In the case of documents, they can have diverse data formats. For example, a word document or a pdf can have text, tables and images in them. This will require you to think about processing strategies and algorithms to extract information.
- Securers: Often times, the data in your SOR’s might have PII data and you will have to devise a mechanism to identify, redact or anonymize these data before sending for further downstream processing.
- Splitters and Chunkers: Spitting and Chunking strategies are important for large documents. There is no science on chunk sizes and chunk overlaps. You will have to trail and error the right configurations to ensure no loss of knowledge and context
- Enrichers: Meta data enrichment, tagging of documents is key step in the indexing pipelines. These document enrichment helps as filters in retrieval pipeline.
- Embedders: Embedding Models range from Cloud hosted, self hosted Open source models, Sparse vector, dense vector models, out of the box, fine tuned models. There is no size fits all on embedding models. You will need to further consider vector dimensions that these models create to strike a balance between data size and semantic accuracy.
- Optimizers: Dimensionality reduction, quantization are some of the techniques to optimally store your vectors in the vector database.
- Indexers: Efficient storage and retrieval is based on the use of the Index and the engine. HNSW and faiss engine has been a defacto for most implementation. However you do have options from tree and hash based indexes and engines in nmslib and lucene.
Knowledge Base:
Deciding on the best fit knowledge in itself is worthy of a separate blog post. Here are some of the considerations.
- Purpose built vs. Extended with Vector capabilities
- Managed vs SaaS vs Self hosted vs Serverless , Open vs. Closed source
- Index support — What type of Indexes does it support
- Type of compression supported, ranking algorithms supported
- What is the Ingestion and Search speed, Recall / latency trade off
- Does it just support in-Memory and/or on-disk
- What type of search does it support — Hybrid / vector /
- Cost of Operations, Security, Integration MLOps
The above is by no means a complete list. In the following weeks we will address the Vector DB selection matrix.
Retrieval Pipeline
Retrieval pipeline starts with the prompt being sent to the vector database to search and retrieve relevant document chunks. The prompt information needs to be vectorized using the same embedding model used in the indexing pipeline.
- Retrievers: You can do vector Search or hybrid search to retrieve the most relevant document chunks. In the case of hybrid search, you will to use a cross encoder algorithm to combine the results from hybrid search.
- Filters: Remember the meta data enrichment used in the indexing pipline. Those can be used to filter out document chunks that are not relevant. This can be done as a part of pre or post retrieval technique
- Re-rankers: Once you have retrieved the relevant documents, depending on how much data you want to send to your generation pipeline, you will have to re-rank and send the best fit document chunks for generation
- Optimizers: Another pattern to consider is to optimize the retrieved data. There are optimization techniques now available to filter out irrelevant informationm repeated information from the retrieved chunks. This also aids in reducing the amount of data sen
Generation Pipeline
Generation pipeline focusses on generation of the output to the prompt from the user. Optimizations in this pipeline can reduce the number of LLM or embedding calls.
- Cachers: One such optimization is caching. Caching the request / request in a in-memory database like Amazon Memory DB for Redis can help in answering repetitive questions and avoid the calls for retrieval and call to the LLM
- Guardrails: Prompt might have sensitive data, toxic information, bias information. So does the completions. Guardrails are a way to avoid those in addition to hallucinations…
- Conversion history: Context to both retrieval and generation is key. You will need to have an capability to store conversation history for the session to drive accuracy of the results to the user
- Formatter: you might run to requirements to format the response in a particular way — json, xml, code, etc….
- Moderation service: is an ability to review the accuracy results, correct them for any inaccuracies and send the final result to the user. There are multiple ways to do this. More about this in the future articles
Let us talk few options to get this done
Amazon Bedrock is a Generative AI platform designed to address enterprise-wide democratization. Amazon Bedrock provides built-in capabilities and extensible capabilities. For more information on Amazon Bedrock refer to one of my earlier posts here
- Bedrock Knowledge base: supports mutliple vector databases including OpenSearch, PineCone, Postgres SQL and more…
- Bedrock Model Hub: Text, Embeddings and multi modal model repositories hosting models from Amazon, partners and Opensource.
Amazon OpenSearch is one of the options to set up the Indexing and Retrieval pipeline. Amazon OpenSearch in combination with models from Amazon Bedrock / Amazon SageMaker can help with the end to end workflow with the level of control and customization we talked earlier. We will get to a deeper discussion / blog on Amazon OpenSearch. In the mean time here are some of the features of OpenSearch…
- Data Prepper: If you use OpenSearch as your knowledge base / Vector database, Open Search provides Data ingestion pipelines that enables you to build your end to end indexing pipeline with specifc tags for source, sink and processors..
- OpenSearch supports lexical, sparse, dense vectors. It comes in 2 models — managed and serverless, supports hybrid search, use of models from Amazon Bedrock, Amazon SageMaker using its neural Plugins, supports compression and quantization…
You can also use frameworks like Langchain, LlamaIndex or Haystack to codify these pipeline components. Other interesting framework i came across is unstructured. If you want to learn more about this there is a deeplearning.ai class on unstructured.
For more information or implementing a quick fire POC using AWS services, you can refere to this git repo — https://github.com/thandavm/rag-implementations/tree/main/bedrock