Capria is investing in rapid prototyping of applied GenAI systems both for our portfolio companies and for our internal use. One such use case is called GUS (Get Us Smarter) which is a LLM Chatbot capable of answering questions from our proprietary internal document set. Below in Part 1 you’ll see an interesting problem we encountered, and how we fixed it. In Part 2 of this post, I outline considerations for using a vector database with LLM processing vs. fine-tuning an open-source LLM like LLAMA or Alpaca.
Part 1: Problem with GUS Chatbot’s Retrieval Accuracy
Our Chatbot comprises two layers:
1. Semantic Similarity Search: This layer scans the vector database (VDB) for relevant data based on the query’s semantic meaning and sends back chunks of information from various documents.
2. LLM (Language Model): This processes the data from the first layer to formulate a answer.
The fixed size of each chunk we put in VDB is defined at index time.
When questions about our “mailing address” were asked, GUS would often return an unrelated response, mentioning it couldn’t find the address. On further investigation, the context supplied to the LLM was often filled with multiple mentions of the term “email address” and no actual mention of the “mailing address.”
Semantic searches understand questions at a deeper level than just string matches. Thus, it’s feasible for “mailing address” to have semantic closeness to “email address”. The challenge arises when multiple occurrences of semantically close but incorrect terms (like “email address”) overshadow the actual required term (“mailing address”).
Tesla vs. Prius Analogy
Imagine preferring a Tesla over any other car. However, if offered one Tesla or ten Toyota Priuses, many would opt for the bulk offer due to perceived value, even if it’s not a top choice. Similarly, multiple mentions of a term that’s semantically close can overshadow the actual required term in the bot’s retrieval mechanism.
Reducing the chunk size can decrease the chances of having multiple occurrences of similar but wrong terms. By decreasing our chunk size from 2000 words to 500 words, we achieved notably better retrieval accuracy. This issue showcases the intricacies of building effective Chatbots. The Tesla vs. Prius analogy underscores one of many challenges.
Part 2: Vector DB + GPT4 vs. Fine-Tuning
Recently I was consulting one of our very senior (and pragmatic) AI advisors about our recent experiences with our internally-developed GUS GenAI tool. As you see in Part 1, we are experimenting and just now rolling GUS out for internal testing. We will have much more to say about the entire effort soon, but I wanted to share some more real-time learning. For context, GUS is what I’d call an LLM-enhanced intranet search. We have a discrete set of docs “chunked” and stored in our VDB (Pinecone) and we use the LLM (GPT4 at the moment) to process and present the results we get from the VDB search.
When is the Right Time to Embrace Fine-Tuning?
To cut to the chase: I was discussing with our advisor the pros/cons of this architecture vs. fine-tuning LLAMA (or Alpaca). He made a strong point that rather than fine-tuning a lesser LLM, we should be much better off focusing on the VDB + GPT4 architecture, making use of the large context window of GPT4, and ensuring we are exploring good prompt engineering. His core points were that with appropriate chunking of the data and even pre-processing before vectorization, the semantic understanding of GPT4 is so far superior to LLAMA that we should get way better results with this architecture. That said, there are limits to what we can get into even a 32k context window, and our next phase of development will be to index a much large set of docs from which we want to drive not just recall but comparison and analysis through the LLM. So we will be trying both approaches in parallel. We will share more as we progress and learn.
Enhancements in Document QA through Chunk Summarization
In the same advisor conversation last week, I was introduced to a pivotal concept concerning document Q&A. At present, our approach is to segment documents into fixed-sized chunks. As highlighted in Part 1, retrieval accuracy can sometimes be compromised due to the interference of terms that sound alike. It’s crucial to note that this isn’t a constant issue, but it remains a challenge we aim to address.
To tackle this, the proposition is to distill (aka pre-process) each chunk to encapsulate its primary essence. For illustration:
– Chunks detailing the mailing address would be refined to: “contains the mailing address”.
– Those with information on the email address would read as: “contains information regarding email address”.
This can be done with GPT4 API using the best of GPT’s summarization / distilling capabilities, which are formidable. This method would effectively enable us to gauge the relevance of a chunk based on the embedded succinct summary. Admittedly, the initial phase of this procedure demands substantial resources, especially since each chunk needs pre-processing. Nevertheless, we believe these long-term implications will dramatically bolster the Chatbot’s precision.
Important Note: This approach is an addition, and does not intend to supplant the conventional method of determining relevance through comprehensive chunk analysis.
This feature will soon be incorporated into GUS, our testbed for GenAI. Stay tuned for more insights.