blog.gopenai.com
Open in
urlscan Pro
162.159.152.4
Public Scan
Submitted URL: https://blog.gopenai.com/enhancing-tabular-data-analysis-with-llms-78af1b7a6df9
Effective URL: https://blog.gopenai.com/enhancing-tabular-data-analysis-with-llms-78af1b7a6df9?gi=3a154fcce139
Submission: On October 25 via api from US — Scanned from CA
Effective URL: https://blog.gopenai.com/enhancing-tabular-data-analysis-with-llms-78af1b7a6df9?gi=3a154fcce139
Submission: On October 25 via api from US — Scanned from CA
Form analysis
0 forms found in the DOMText Content
Open in app Sign up Sign in Write Sign up Sign in ENHANCING TABULAR DATA ANALYSIS WITH LLMS Wenxin Song · Follow Published in GoPenAI · 12 min read · Feb 5, 2024 219 2 Listen Share 1. INTRODUCTION In the rapidly evolving landscape of data processing and analysis, Large Language Models (LLMs) stand at the forefront, offering groundbreaking capabilities that extend beyond traditional text-based applications. A particularly intriguing and less explored domain is the use of LLMs in interpreting and reasoning over tabular data. This blog delves into the intricacies of leveraging LLMs to query tabular data, a niche yet immensely potent application that promises to transform how we interact with structured datasets. At the heart of our exploration are two innovative technologies: LlamaIndex and LocalAI. LlamaIndex, embodying the principles outlined in the state-of-the-art papers “Rethinking Tabular Data Understanding with Large Language Models” and “Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding”, serves as a pivotal tool in our quest. It adeptly implements strategies from these papers, making the theoretical practical. Complementing LlamaIndex, LocalAI offers a seamless environment to launch and interact with LLMs locally. This synergy not only democratizes access to advanced data querying methods but also propels the usability of LLMs to new heights. This blog aims to demystify and demonstrate the use of LLMs in querying tabular data using human language. We will embark on a comprehensive journey, showcasing the process of launching local LLMs using LocalAI’s docker images, and interfacing them with OpenAI compatible API service. Our adventure will not just be theoretical; we will provide a hands-on demonstration, setting up the entire system in docker containers, adaptable for both CPU-only and GPU-equipped machines. 2. THEORETICAL BACKGROUND The realm of tabular data, with its structured yet complex nature, presents unique challenges and opportunities for LLMs. In this section, we delve into the theoretical underpinnings based on two papers, which lay the foundation for our practical exploration. “RETHINKING TABULAR DATA UNDERSTANDING WITH LARGE LANGUAGE MODELS” This paper serves as a cornerstone for understanding the capabilities and limitations of LLMs in interpreting tabular data. It operates on three core perspectives: 1. Robustness to Structural Perturbations: The study reveals that LLMs exhibit a notable performance decline when faced with structural variations in tables. This insight is crucial as it highlights the need for robust models that can maintain accuracy despite changes in table formats. 2. Textual vs. Symbolic Reasoning: A comparative analysis between textual and symbolic reasoning underscores a slight edge for textual reasoning in handling tabular data. However, the strengths of each approach vary based on specific tasks, suggesting a nuanced application of these reasoning methods. 3. Boosting Performance through Reasoning Pathways: Perhaps the most significant contribution is the proposal to aggregate multiple reasoning pathways. By integrating textual and symbolic reasoning and employing a mix self-consistency mechanism, the model achieves state-of-the-art performance. This approach not only enhances accuracy but also paves the way for more sophisticated table processing paradigms in LLMs “CHAIN-OF-TABLE: EVOLVING TABLES IN THE REASONING CHAIN FOR TABLE UNDERSTANDING” This paper introduces the innovative concept of the Chain-of-Table framework, revolutionizing how LLMs interact with tabular data: 1. Tabular Data in the Reasoning Chain: The paper proposes incorporating tabular data explicitly into the reasoning chain. This method contrasts with traditional approaches that primarily rely on textual context, offering a novel way to leverage the structured nature of tables. 2. Iterative Table Evolution: The framework guides LLMs to iteratively generate operations and update the table, effectively creating a ‘tabular reasoning chain.’ This dynamic evolution allows the model to plan subsequent operations based on previous results, mirroring a more human-like reasoning process. 3. Structured Intermediate Results: An intriguing aspect of this approach is how the evolving table carries structured information of intermediate results. This not only makes the reasoning process transparent but also enhances the reliability and accuracy of predictions. BRIDGING THEORY AND PRACTICE These papers collectively form a robust theoretical framework that guides our practical application of LLMs in querying tabular data. They shed light on the nuances of tabular data processing, emphasizing the need for models that are adaptable, capable of complex reasoning, and sensitive to the structured format of tables. As we move forward in this blog, these theoretical insights will be the bedrock upon which we build our practical demonstrations using LlamaIndex and LocalAI. 3. TECHNOLOGY BACKGROUND LLAMAINDEX AND ITS LLAMA PACKS LlamaIndex is a versatile ‘data framework’ essential for building LLM applications. It simplifies the process of ingesting data from a variety of sources and formats, including APIs, PDFs, documents, and SQL. The framework is adept at structuring data using indices and graphs, ensuring seamless compatibility with LLMs. One of its key features is an advanced retrieval and query interface, which allows users to input LLM prompts and receive contextually enriched responses. Llama Packs complements LlamaIndex as a community-driven hub that offers a range of prepackaged modules to jumpstart LLM application development. These modules are designed for diverse applications, from creating Streamlit apps to facilitating advanced retrieval and structured data extraction in resumes. A key feature of Llama Packs is the flexibility it offers users, allowing them not only to import modules for immediate use but also to inspect and customize them to meet specific needs and preferences. LOCALAI FRAMEWORK LocalAI presents itself as a free, Open Source alternative to OpenAI, offering a unique solution for those seeking local inferencing capabilities. It functions as a seamless drop-in replacement REST API, fully compatible with OpenAI’s API specifications. Designed to run Large Language Models (LLMs), generate images, and produce audio among other functionalities, LocalAI is versatile in its application. Notably, LocalAI supports multiple model families and architectures, and it operates effectively without the need for a GPU. 4. PRACTICAL DEMONSTRATION DEMO ENVIRONMENT For guidance on configuring AWS EC2 CPU or GPU instances to run local LLMs in Docker containers, refer to this blog. RUNNING ON AN AWS CPU INSTANCE For launching the demo on AWS CPU instances, here’s a concise step-by-step guide to get you started: 1. Cloning this repo to your EC2 CPU instance: git clone https://github.com/LinkTime-Corp/llm-in-containers.git cd llm-in-containers/tabular-data-analysis 2. Insert your OpenAI API Key into conf/config.json for “OPENAI_API_KEY”. This step can be skipped if you don’t want to evaluate against the OpenAI backend. 3. Download local models. If you have a problem with the wget command, you can manually download the model from this link and save it in the ‘models’ directory. bash download-models.sh 4. Launch the demo: bash run.sh 5. Visit the UI at http://{IP of EC2 CPU instance}:8501. 6. Shut down the demo. bash shutdown.sh OPEN THE WEB UI Now let’s play around with the UI by uploading some sample data: 1. Visit the provided link to download a set of sample CSV files. Unzip the downloaded file. For the demo, we will use the file located at “WikiTableQuestions/csv/200-csv/11.csv”. 2. Once on the UI, start by uploading a CSV file, such as the one you just unzipped. Select the “LLM Type” for processing your query. You can choose between “ChatGPT” and “Local_LLM”. 3. Choose the engine to query your tabular data file. There are two options: “MixSelfConsistency” and “ChainOfTable”. 4. With these selections made, you can now ask questions related to the data in your CSV file. For example, “Who won the Best Actor award in 1972?”. Click the “Query” button to submit your question and receive the answer from the chosen LLM. 5. On the sidebar of the UI, there’s an option to view the LLM inference trace. This feature allows you to see the step-by-step processing of your question by the LLM, providing insights into how the answer was derived. Web UI for querying CSV files RUNNING ON AN AWS GPU INSTANCE For launching the demo on AWS GPU instances, here’s a concise step-by-step guide to get you started. It is similar to the CPU instance setup, but with the addition of the “-gpu” flag in the ‘run.sh’ and ‘shutdown.sh’ scripts: 1. Cloning this repo to your EC2 GPU instance: git clone https://github.com/LinkTime-Corp/llm-in-containers.git cd llm-in-containers/tabular-data-analysis 2. Insert your OpenAI API Key into conf/config.json for “OPENAI_API_KEY”. 3. Launch the demo: bash run.sh -gpu 4. Visit the UI at http://{IP of EC2 GPU instance}:8501. 5. Shut down the demo. bash shutdown.sh -gpu 5. CODE STRUCTURE The code structure of the demo is designed to be simple and intuitive, with different components organized into separate files for clarity and ease of maintenance: * main.py: This file contains the code for the User Interface (UI). * backend.py: It’s responsible for handling the logic of choosing LLMs & Query Engines and interacting with LLMs. * constants.py: All the constants used throughout the codebase are defined here. 6. UNRAVELING THE MECHANICS OF QUERY ENGINES In the demonstration, we utilized a CSV file named “11.csv” from the “WikiTableQuestions/csv/200-csv” sample dataset. This CSV file contains structured tabular data shown in the below table. Let’s explore how the query engines respond to the question, “Which nominee won the best actor for the Academy Awards in 1972?”. The CSV file used in the demo 6. 1 “MIXSELFCONSISTENCY” QUERY ENGINE The “MixSelfConsistency” engine operates by cycling through two distinct types of query paths, which are configurable in terms of iterations. These two paths are called “Textual Reasoning” and “Symbolic Reasoning”. “TEXTUAL REASONING” PATH This path is straightforward. It operates by integrating the contents of a CSV file directly into the prompt, thereby forming a comprehensive query that is then presented to the LLM. Since we ran this path three times, we got the result list: ['Gene Hackman', 'Gene Hackman', 'Gene Hackman']. “SYMBOLIC REASONING” PATH This path uses LlamaIndex’s PandasQueryEngine. This query engine loads a CSV file into a pandas dataframe, and then generates panda instructions for the given question to get the results. For the demo, we got three pandas instructions, each corresponding to one of the three iterative runs. First Run: df[(df['Award'] == 'Academy Awards, 1972') & (df['Category'] == 'Best Actor') & (df['Result'] == 'Won')]['Nominee'] Second Run: df[(df['Award'] == 'Academy Awards, 1972') & (df['Category'] == 'Best Actor') & (df['Result'] == 'Won')]['Nominee'].iloc[0] Third Run: df[(df['Award'] == 'Academy Awards, 1972') & (df['Category'] == 'Best Actor') & (df['Result'] == 'Won')]['Nominee'] So the result list is: [ '2 Gene Hackman\nName: Nominee, dtype: object', 'Gene Hackman', '2 Gene Hackman\nName: Nominee, dtype: object' ] SELF CONSISTENCY AGGREGATION This final process aggregates the counts of items that appear in the combined list generated from both “Textual Reasoning” and “Symbolic Reasoning”. It then returns the item with the highest count. In our demo, the item that emerged with the maximum count, indicating it as the most likely answer, was ‘Gene Hackman’. PROS AND CONS The “MixSelfConsistency” query engine, by merging textual and symbolic reasoning and utilizing a mixed self-consistency mechanism, boosts accuracy. However, conducting multiple iterations of these query paths may lead to longer overall response times. 6.2 “CHAINOFTABLE” QUERY ENGINE The “ChainOfTable” engine initially employs a series of operations to modify the original table into a format that directly addresses the query. Subsequently, it combines this revised table and the question to create a prompt, enabling the LLM to derive the final answer. This last phase bears resemblance to the “Textual Reasoning” path in “MixSelfConsistency” engine. CHAIN OF TABLE OPERATIONS Let’s delve into its method of constructing a chain of operations. In each iteration, the engine prompts the LLM to suggest the next operation, considering both the current table state and a history of previous operations. The potential operations include: 1. f_add_column(): This function is used to add a new column, especially when the table requires additional inferred data to accurately respond to the query. 2. f_select_row(): When only specific rows are relevant to the question, this operation is employed to isolate and focus on these rows. 3. f_select_column(): Similar to row selection, this function narrows down the table’s focus to certain columns deemed necessary for answering the query. 4. f_group_by(): For queries involving items with identical values and their counts, this operation groups such items, enhancing clarity in the data presentation. 5. f_sort_by(): If the query pertains to the order or ranking of items within a column, this function sorts the items accordingly to align with the question’s context. DEMO CASE Returning to our demo, let’s revisit the query, “Which nominee won the best actor for the Academy Awards in 1972?”. In response, the “ChainOfTable” engine, during its first iteration, executes the operation f_select_row([‘row 3’]). This action results in the creation of a new table, structured as follows: Result for f_select_row([‘row 3’]) The final query becomes straightforward, directly yielding the final answer: “Gene Hackman.” PROS AND CONS The “ChainOfTable” engine creates a sequence of operations that transform the original table into a version more aligned with the final question. This method significantly enhances accuracy for queries whose answers are not immediately apparent from the original table, necessitating a series of table operations for clarification. However, this process requires each interaction with the LLM to incorporate the current table’s content within the prompt. This approach can impact the performance of LLMs, particularly when dealing with large tables, as the size of the data directly influences the processing load. 6.3 PERFORMANCE COMPARISON In the demonstration of our query engine, each execution of a query provides two key pieces of information: the response from the query engine and the time taken to generate that response. From our experiments using the CSV file in the demo, we observed that the “MixSelfConsistency” engine tends to be faster and more accurate compared to the “ChainOfTable” engine when ChatGPT is selected as the LLM. However, it’s important to note that these findings are not derived from a systematic benchmark test or a comprehensive comparison of the two query engines. The results we mention are based solely on our limited experiments. As such, they should be viewed as preliminary observations rather than definitive conclusions. We encourage individuals who have an interest in this field to use our demo as a starting point for more extensive comparisons or benchmark tests. 7. OTHER TAKEAWAYS INTERACTING WITH LLMS A critical aspect of the implementation in this demo is establishing the connection to the APIs used for querying the LLMs. This includes setting up the connection to the OpenAI API for ChatGPT and a similar API for local LLMs. Here is the first part of the code: self.openai_llm = OpenAI(model=OPENAI_API_MODEL) self.local_llm = OpenAILike( api_base=API_BASE, api_key=API_KEY, model=MODEL_NAME, is_chat_model=True, is_function_calling_model=True, context_window=3900, timeout=MAC_M1_LUNADEMO_CONSERVATIVE_TIMEOUT, ) In addition to the above code, setting up the ServiceContext for these LLMs is also important, especially for local LLMs. For a local LLM, it involves using a local embed_model from Huggingface (from this [document](https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html), the local embed_model refers to “BAAI/bge-large-en”): if llm_type == GPT_LLM: chosen_llm = self.openai_llm embed_model = OpenAIEmbedding(embed_batch_size=10) service_context = ServiceContext.from_defaults( chunk_size=1024, llm=chosen_llm, embed_model=embed_model) else: chosen_llm = self.local_llm service_context = ServiceContext.from_defaults( chunk_size=1024, llm=chosen_llm, embed_model="local") set_global_service_context(service_context) The above implementation is functional but is not considered ideal in terms of flexibility and ease of use when switching between different LLMs, such as OpenAI and local LLMs. In an ideal setup, classes like `OpenAI` or `OpenAILike` should be capable of establishing connections with both OpenAI models and local LLMs. This would be facilitated by simply specifying the `api_base` and `api_key` for these compatible APIs. As explained by LlamaIndex, the `OpenAILike` class is a thin wrapper around the OpenAI model. Its purpose is to ensure compatibility with third-party tools that offer an OpenAI-compatible API. However, a limitation arises as LlamaIndex currently restricts the use of custom models with their `OpenAI` class, primarily due to the need to infer certain metadata from the model’s name. This limitation underscores the need for an optimized implementation in the future. Such an enhancement would enable users to easily integrate and switch between different LLMs, whether they are OpenAI or local LLMs. OPENAI COMPATIBLE API SERVICE The decision to utilize an OpenAI Compatible API Service as an intermediary between the application and LLMs is strategic, and aimed at enhancing modularity and flexibility. This approach allows for the seamless swapping of LLMs without changing the code of applications, mitigating compatibility issues that may arise from direct integration. Such a setup ensures that applications remain agnostic to the specific LLMs they interact with, facilitating easier updates and maintenance. During the process of selecting an appropriate API service, our initial choice was the GPT4all Rest API. As we all know, GPT4All is a nice tool to democratize access to LLMs by enabling their use on standard hardware. However, right now GPT4all Rest API is not compatible with the current OpenAI API specifications, therefore we cannot switch backend LLM service. Subsequently, we evaluated LocalAI, which proved to be compatible and functioned effectively with the LlamaIndex OpenAILike class. This compatibility was crucial for our requirements, demonstrating LocalAI’s adherence to current specifications and its ability to integrate smoothly with our framework. MANAGING THE SIZE OF DOCKER IMAGES Choosing to run our demo within Docker containers was driven by the numerous advantages Docker provides for LLM applications, such as enhanced portability, reproducibility, scalability, security, resource efficiency, and simplified deployment processes. However, during the process of constructing Docker images for our demonstration, we observed a significant increase in the image size following the installation of PyTorch. To address this and reduce the Docker image size for CPU instances, we opted to install PyTorch directly from the official wheel available at http://download.pytorch.org/whl/cpu: pip install --no-cache-dir -r requirements.txt \ --extra-index-url https://download.pytorch.org/whl/cpu This approach significantly reduced the compressed image size to just 435.06 MB, compared to the considerably larger compressed size of 5.38 GB for GPU instances. Adopting this strategy is particularly effective for those looking to specifically build images tailored for CPU instances, offering a balance between functionality and efficiency. 8. GITHUB LINK: https://github.com/LinkTime-Corp/llm-in-containers/tree/main/tabular-data-analysis SIGN UP TO DISCOVER HUMAN STORIES THAT DEEPEN YOUR UNDERSTANDING OF THE WORLD. FREE Distraction-free reading. No ads. Organize your knowledge with lists and highlights. Tell your story. Find your audience. Sign up for free MEMBERSHIP Read member-only stories Support writers you read most Earn money for your writing Listen to audio narrations Read offline with the Medium app Try for US$5/month Large Language Models Tabular Data Docker Llamaindex 219 219 2 Follow WRITTEN BY WENXIN SONG 83 Followers ·Writer for GoPenAI Cofounder & CTO @ LinkTimeCloud. Ph.D. in CS @ StonyBrook, ex-Big Data Engineer @Ask .com, ex-Sr. Eng. Manager @EA . Passionate about K8s, Big Data, and LLMs. Follow MORE FROM WENXIN SONG AND GOPENAI Wenxin Song in GoPenAI RUNNING PDF PARSERS IN DOCKER CONTAINERS INTRODUCTION Mar 1 8 kirouane Ayoub in GoPenAI FINE-TUNING EMBEDDINGS FOR SPECIFIC DOMAINS: A COMPREHENSIVE GUIDE IMAGINE YOU’RE BUILDING A QUESTION ANSWERING SYSTEM FOR A MEDICAL DOMAIN. YOU WANT TO ENSURE IT CAN ACCURATELY RETRIEVE RELEVANT MEDICAL… Sep 30 455 2 Paras Madan in GoPenAI BUILDING A MULTI PDF RAG CHATBOT: LANGCHAIN, STREAMLIT WITH CODE TALKING TO BIG PDF’S IS COOL. YOU CAN CHAT WITH YOUR NOTES, BOOKS AND DOCUMENTS ETC. THIS BLOG POST WILL HELP YOU BUILD A MULTI RAG… Jun 6 781 5 Wenxin Song INTRODUCING KDP, A BIG DATA PLATFORM ON KUBERNETES INTRODUCTION Mar 14, 2023 102 1 See all from Wenxin Song See all from GoPenAI RECOMMENDED FROM MEDIUM Katsiaryna Ruksha THREE PATHS TO TABLE UNDERSTANDING WITH LLMS LLMS (AS THEIR NAME STATES) ORIGINALLY AIM AT DEALING WITH UNSTRUCTURED DATA IN TEXT FORMAT. BUT CAN THEY BE APPLIED TO STRUCTURED… Jun 3 184 Salvatore Raieli in Level Up Coding TABULA RASA: LARGE LANGUAGE MODELS FOR TABULAR DATA TABULAR DATA ARE EVERYWHERE, WHY AND HOW YOU CAN USE LLMS FOR THEM Mar 19 480 3 LISTS NATURAL LANGUAGE PROCESSING 1768 stories·1374 saves AI REGULATION 6 stories·595 saves CODING & DEVELOPMENT 11 stories·863 saves CHATGPT PROMPTS 50 stories·2138 saves Intel in Intel Tech TABULAR DATA, RAG, & LLMS: IMPROVE RESULTS THROUGH DATA TABLE PROMPTING HOW TO INGEST SMALL TABULAR DATA WHEN WORKING WITH LLMS. May 14 422 5 Pavan Emani in Artificial Intelligence in Plain English GOODBYE, TEXT2SQL: WHY TABLE-AUGMENTED GENERATION (TAG) IS THE FUTURE OF AI-DRIVEN DATA QUERIES! EXPLORING THE FUTURE OF NATURAL LANGUAGE QUERIES WITH TABLE-AUGMENTED GENERATION. Sep 11 1.3K 18 Ingrid Stevens EXTRACT STRUCTURED DATA FROM UNSTRUCTURED TEXT USING LLMS USING LANGCHAIN’S CREATE_EXTRACTION_CHAIN AND PYDANTICOUTPUTPARSER Jan 1 666 3 LucianoSphere (Luciano Abriata, PhD) in Towards Data Science POWERFUL DATA ANALYSIS AND PLOTTING VIA NATURAL LANGUAGE REQUESTS BY GIVING LLMS ACCESS TO… GETTING LLMS TO ANALYZE AND PLOT DATA FOR YOU, RIGHT IN YOUR WEB BROWSER Jan 24 251 3 See more recommendations Help Status About Careers Press Blog Privacy Terms Text to speech Teams