Philip Piletic December 10, 2024

Collected at: https://datafloq.com/read/why-rag-vectorization-future-data-analytics/

Data analytics is moving at the speed of light. Are you keeping up? Machine learning-powered analytics enabled faster analysis for larger datasets, and then came generative AI (GenAI), bringing with it self-service business intelligence (SSBI) and “GenBI” chatbots that let you ask natural language queries. 

And still, data analytics keeps racing ahead. People want tools that are faster, more accurate, and even easier to use for line of business, non-data science users. What’s more, now they want insights that are personalized and relevant to specific use cases, business concerns, and their own organization’s circumstances. 

All of this is ushering us towards what many experts are hailing as the next big upgrade in the use of LLMs for analytics: Retrieval-Augmented Generation, or RAG, which uses data pulled from third-party sources to improve content generation, along with vectorization, which converts complex data into numerical vectors for precision retrieval and merging with proprietary sources.

“RAG is the business of taking proprietary corporate datasets, and being able to query them and use them as part of the LLM flow, without taking that information and putting it into the LLM itself,” explains Avi Perez, CTO of Pyramid Analytics, in a recent interview.

“Vectorization, in the broader sense, is the grand gluing together of public, open source information that the LLM has been trained on and knows about,” he continues, “with something specific about my business which isn’t public.” 

What is it, though, that makes RAG and vectorization so pivotal to the future of AI-enabled analytics?

The Challenge Today

When it comes to SSBI, user expectations are high, and patience is slim. People want to use the magic of GenAI to gain fast answers about their own business in the context of the wider business environment, andthey want it all delivered with the same accuracy as if they were asking for directions to the nearest Starbucks.

This requires working with models that have full knowledge of your proprietary business and/or sensitive customer data, as well as extensive data from other, wider sources like the greater industry, regional market, and global economy. 

But there are obstacles to achieving this goal. Most companies have enormous databases that would slow LLMs down too much, and the data isn’t usually organized for quick retrieval. Security and privacy concerns often discourage businesses from connecting their data directly to LLMs. As a result, the models don’t always have the data needed to answer your queries, or the conversational continuity to learn from your queries and improve the relevance of their responses over time. 

How RAG Helps

RAG combines retrieval with generative capabilities, so it can find the right data among available sources and produce relevant answers. In theory at least, RAG actualizes the aforementioned magic of GenAI for business contexts by bringing full knowledge about proprietary business and/or sensitive customer data together with data from other, wider sources. 

“The key idea behind RAG is that having access and conditioning on relevant background knowledge can significantly improve generative model performance on downstream NLP tasks,” says Bijit Ghosh, CTO Global Head of Cloud Product and Engineering & AI/ML at Deutsche Bank. 

“The RAG paradigm allows models to have impressive retrieval abilities for gathering relevant information, combined with excellent natural language generation capabilities for producing fluent, human-like text. This hybrid approach leads to state-of-the-art results on tasks ranging from open-domain question answering to dialog systems.”
 

RAG certainly succeeds in enabling more accurate content and less hallucination. It works faster to index, retrieve, and decode data; it’s applicable to many domains, including enterprise search, contextual recommendations, and open data exploration; and it’s scalable to large repositories – within limits, that is. 

But it’s not a silver bullet, and the following challenges to retrieving the right data remain:

  • There’s simply too much data available to process for authoritative answers
  • The data is often ambiguous, generally because it hasn’t been carefully labeled and indexed, and the retrieval engine struggles with ambiguity 
  • The retrieval engine struggles to understand complex queries
  • Most RAG approaches need large human-labeled training datasets, which are scarce

Which is why vectorization needs to be part of the picture as well. 

The Promise of Vectorization

Vectorization turns text-based data points into numerical vectors, or a sequence of numbers, to represent the data content. This allows for more precise searches than when data is stored as chunks of text blocks. Often, when people talk about fine-tuning an LLM, they’re actually describing the need for vectorization.

“Imagine each piece of information in the knowledge base as a unique fingerprint. Vector embeddings are mathematical representations that capture the essence of this information, like a digital fingerprint. The retrieval system uses vector search to find information in the knowledge base with fingerprints most similar to the user’s query,” explains Priyanka Vergadia, who leads developer engagement strategy for Microsoft Azure.

“These components allow RAG to leverage the strengths of LLMs while incorporating specific knowledge from external sources. This leads to more informative, accurate, and trustworthy outputs for tasks like question answering, chatbot interactions, and more.” 

In brief, vectorization helps make RAG work in the real world. The process reduces ambiguity in datasets, helps RAG find the right data in large, disorganized, unlabeled data lakes and warehouses, and generally makes it easier for the engine to scan data and find relevant data points. As a result, AI can produce data insights with less hallucination, faster searches, and higher scalability – even when massive datasets are involved.  
However, vectorization is neither cheap nor easy. You still need a high semantic layer to make the vectorization work, involving RAG pipelines with different embedding models, chunking strategies, and retrieval settings. 

The Solution Has Yet To Be Actualized

It all sounds great, but many parts of the tech equation still aren’t clear. Productionized capabilities in the space, especially around RAG for documents, don’t solve issues of governance, data security, or performance on a vast scale, and even with vectorization, there can be too much data.

Many business analysts rely on truly “big data” resources. If you load these into an LLM, it would make it far too slow. Nobody is willing to wait three minutes for an answer to their query. It’s also impossible to load it all into a vector database and still get the performance and scale you want. 

Ultimately, tech can’t solve the data issue just yet. At the moment, you still need a human in the loop to remove the noise and pare back the data to leave only that which is relevant. It has to be someone who understands the business problem, the use case details, and how to formulate a data query that addresses the business problem. This is serious heavy lifting that can’t yet be delegated to a tech solution, although AI software companies are working hard to close these gaps. 

Additionally, RAG and vectorization involve many variables – embedding models, chunking strategies, retrieval settings. You need a high semantic layer to optimize the way they work, involving RAG pipelines with different embedding models, chunking strategies, and retrieval settings, so that you can find the one that’s most useful for the use case at hand. 

On the Brink of a New Era

The challenges that currently confront data analytics are significant, and RAG and vectorization don’t sweep them all away. However, it’s still early days, and they do represent the best path to finding a solution. It’s similar to where we were two years ago, when someone wanted to build a reporting or an analytics solution on a data warehouse, and had to cut back the noise to make something functional and usable. The analytics ecosystem has resolved that challenge, and we’ll resolve this one too.

Leave a Reply

Your email address will not be published. Required fields are marked *

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments