Architecture for Multi-Source Data Analysis with Generative AI

The classic post-modern problem in organizations where heaps of data generate no insights.

"People from all specialties, from analysts to CEOs, struggle to be creative in their decisions while still being supported by facts and data. Dashboards that do not generate insights, reverse engineering to arrive at predetermined conclusions, ETL (Extraction, Transformation, Loading) routines that overload critical systems – there are problems of all kinds when it comes to data analysis. Indicators and metrics serve only for control and monitoring purposes, nothing more. Sparse and voluminous information, little room for creative action. Well, Generative AI can help us untangle this web of informational excess! Here opens up a possibility, as part of this technology's capability is the ability to "reason" about information, presenting unbiased conclusions for our creative and broader consumption, reducing cognitive load. My proposal with this analysis is precisely to present an architecture that helps us reach conclusions, with semantic machines placed in this much-needed intermediary, summarizing vast amounts of data, eliminating excess."

"Some questions need to be answered when proposing such a real-time data analysis architecture that comes from various sources. One of them is how a Language Model like LLM understands and absorbs structured data such as spreadsheets, charts, and tables, as its focus is on plain text. Spoiler: There are intermediate products that help us with this. Other questions revolve around: How to organize ETL (Extraction, Transformation, Loading) routines for Generative AI, what are the differences from the current architecture? And I believe the most relevant question is how to combine data of different natures and dimensions that lead to a place where real insights are generated? And oh, will we consume it traditionally through dashboards, or will we have conversations with the model? It seems like it will be conversations! Well, let's explore each of these points one by one:

Generative AI and the incorporation of structured (non-textual) data

As many know, LLMs are trained with vast amounts of data, including text, and can also be enhanced with proprietary data. But how do you proceed in the case of a model that will incorporate information from different structures? How do you include even dashboards that people consume and raw data directly from SQL databases in a natural language model, for example? The answer is that we will likely need different techniques for each type of inclusion. In the case of spreadsheets and other "relational" data, one of the tactics to be used will be the so-called "Knowledge Graphs," and in this field, there are open-source frameworks emerging at the forefront, such as Neo4J, which was recently incorporated into LangChain. What Neo4J does is nothing more than discovering the inferential relationship between different pieces of information that are arranged in N relational databases with different structures. Paradoxically, this is precisely the problem that many corporations face: reconciling vast inferences scattered across X zillions of tables and dashboards."

In other words, Neo4J and/or generic knowledge graph managers "plug in" their different sources of knowledge like spreadsheets, dashboards, SQL extractions, and so on, and compile them to be incorporated into any LLM using the LangChain framework. Just like the Vector Database extensively explained in this blog, Knowledge Graphs are gaining traction precisely because they don't work with plain text and can compile data from different types of sources. There are also other solutions outside the open-source context, such as Vertex AI and Microsoft Graph, which are viable options for large companies, despite their limitations. These are possible solutions for companies looking to create a truly intelligent data hub. :)

Extraction, Transformation, and Loading (ETL) for Generative AI

Similar to current ETL routines, for new products for strategic data consumption with Generative AI, we will also need transformation processes, and these can be costly! However, there is a tangible and advantageous difference here, which I'll explain. The first difference is that you can eliminate "band-aid" solutions and redundancies and build knowledge graphs directly from the raw core systems such as CRMs, sales systems, etc., eliminating any intermediate microservices and replicas that exist solely for generating reports. This already results in significant processing savings. Another advantage is that these frameworks autonomously discover the inferential relationships between entities; you don't need to teach them that "enterprise X" in table "Y" refers to the same "enterprise X" in the sales table that had "Z" sales – the model itself makes these inferences. This will save a lot of time for data scientists who are typically employed to model these relationships in traditional machine learning models. The money spent on SaaS for graph training will be offset by freeing data analysts from their repetitive work. Still in denial that Generative AI won't transform our entire IT architecture? It's time to think outside the box!

Data Consumption: Conversations over Structured Dashboards

Finally, here we have possibly the most significant paradigm shift that we will likely experience with Generative AI as a whole. We will consume fewer compiled pieces of information in dashboards and instead start asking questions to language models while incorporating real-time transactional data. Rather than opening a report filled with charts and painstakingly compiled textual explanations, as we have done for decades in a company's daily operations, presentations, and research and discovery, we will converse with the model, which will summarize everything and, through questions, perform inferences enhanced by insights generated by the model itself! Are we prepared for this? I believe not, but we are on the path; it will require a cultural shift from expecting past results to envisioning future scenario possibilities. It's about knowing how to ask the right questions! In a corporate monologue-driven world, we need to go back to something we've left behind—conversation! 🍻

PS: These articles are not written by ChatGPT. Language models aren't creative, but they are powerful allies in reducing cognitive effort. 😎

Architecture for Multi-Source Data Analysis with Generative AI

Recent Posts

Comments