blog.gopenai.com Open in urlscan Pro
162.159.152.4 Public Scan

Back to summary

Submitted URL:
https://blog.gopenai.com/enhancing-tabular-data-analysis-with-llms-78af1b7a6df9
Effective URL:
https://blog.gopenai.com/enhancing-tabular-data-analysis-with-llms-78af1b7a6df9?gi=3a154fcce139
Submission: On October 25 via api (October 25th 2024, 9:27:41 am UTC) from US — Scanned from CA

Form analysis
0 forms found in the DOM

Text Content

Open in app

Write

ENHANCING TABULAR DATA ANALYSIS WITH LLMS

Wenxin Song

Published in

GoPenAI

·
12 min read
·
Feb 5, 2024

219

Listen

1. INTRODUCTION

In the rapidly evolving landscape of data processing and analysis, Large
Language Models (LLMs) stand at the forefront, offering groundbreaking
capabilities that extend beyond traditional text-based applications. A
particularly intriguing and less explored domain is the use of LLMs in
interpreting and reasoning over tabular data. This blog delves into the
intricacies of leveraging LLMs to query tabular data, a niche yet immensely
potent application that promises to transform how we interact with structured
datasets.

At the heart of our exploration are two innovative technologies: LlamaIndex and
LocalAI. LlamaIndex, embodying the principles outlined in the state-of-the-art
papers “Rethinking Tabular Data Understanding with Large Language Models” and
“Chain-of-Table: Evolving Tables in the Reasoning Chain for Table
Understanding”, serves as a pivotal tool in our quest. It adeptly implements
strategies from these papers, making the theoretical practical. Complementing
LlamaIndex, LocalAI offers a seamless environment to launch and interact with
LLMs locally. This synergy not only democratizes access to advanced data
querying methods but also propels the usability of LLMs to new heights.

This blog aims to demystify and demonstrate the use of LLMs in querying tabular
data using human language. We will embark on a comprehensive journey, showcasing
the process of launching local LLMs using LocalAI’s docker images, and
interfacing them with OpenAI compatible API service. Our adventure will not just
be theoretical; we will provide a hands-on demonstration, setting up the entire
system in docker containers, adaptable for both CPU-only and GPU-equipped
machines.

2. THEORETICAL BACKGROUND

The realm of tabular data, with its structured yet complex nature, presents
unique challenges and opportunities for LLMs. In this section, we delve into the
theoretical underpinnings based on two papers, which lay the foundation for our
practical exploration.

“RETHINKING TABULAR DATA UNDERSTANDING WITH LARGE LANGUAGE MODELS”

This paper serves as a cornerstone for understanding the capabilities and
limitations of LLMs in interpreting tabular data. It operates on three core
perspectives:

1. Robustness to Structural Perturbations: The study reveals that LLMs exhibit
a notable performance decline when faced with structural variations in
tables. This insight is crucial as it highlights the need for robust models
that can maintain accuracy despite changes in table formats.
2. Textual vs. Symbolic Reasoning: A comparative analysis between textual and
symbolic reasoning underscores a slight edge for textual reasoning in
handling tabular data. However, the strengths of each approach vary based on
specific tasks, suggesting a nuanced application of these reasoning methods.
3. Boosting Performance through Reasoning Pathways: Perhaps the most
significant contribution is the proposal to aggregate multiple reasoning
pathways. By integrating textual and symbolic reasoning and employing a mix
self-consistency mechanism, the model achieves state-of-the-art performance.
This approach not only enhances accuracy but also paves the way for more
sophisticated table processing paradigms in LLMs

“CHAIN-OF-TABLE: EVOLVING TABLES IN THE REASONING CHAIN FOR TABLE UNDERSTANDING”

This paper introduces the innovative concept of the Chain-of-Table framework,
revolutionizing how LLMs interact with tabular data:

1. Tabular Data in the Reasoning Chain: The paper proposes incorporating
tabular data explicitly into the reasoning chain. This method contrasts with
traditional approaches that primarily rely on textual context, offering a
novel way to leverage the structured nature of tables.
2. Iterative Table Evolution: The framework guides LLMs to iteratively generate
operations and update the table, effectively creating a ‘tabular reasoning
chain.’ This dynamic evolution allows the model to plan subsequent
operations based on previous results, mirroring a more human-like reasoning
process.
3. Structured Intermediate Results: An intriguing aspect of this approach is
how the evolving table carries structured information of intermediate
results. This not only makes the reasoning process transparent but also
enhances the reliability and accuracy of predictions.

BRIDGING THEORY AND PRACTICE

These papers collectively form a robust theoretical framework that guides our
practical application of LLMs in querying tabular data. They shed light on the
nuances of tabular data processing, emphasizing the need for models that are
adaptable, capable of complex reasoning, and sensitive to the structured format
of tables. As we move forward in this blog, these theoretical insights will be
the bedrock upon which we build our practical demonstrations using LlamaIndex
and LocalAI.

3. TECHNOLOGY BACKGROUND

LLAMAINDEX AND ITS LLAMA PACKS

LlamaIndex is a versatile ‘data framework’ essential for building LLM
applications. It simplifies the process of ingesting data from a variety of
sources and formats, including APIs, PDFs, documents, and SQL. The framework is
adept at structuring data using indices and graphs, ensuring seamless
compatibility with LLMs. One of its key features is an advanced retrieval and
query interface, which allows users to input LLM prompts and receive
contextually enriched responses.

Llama Packs complements LlamaIndex as a community-driven hub that offers a range
of prepackaged modules to jumpstart LLM application development. These modules
are designed for diverse applications, from creating Streamlit apps to
facilitating advanced retrieval and structured data extraction in resumes. A key
feature of Llama Packs is the flexibility it offers users, allowing them not
only to import modules for immediate use but also to inspect and customize them
to meet specific needs and preferences.

LOCALAI FRAMEWORK

LocalAI presents itself as a free, Open Source alternative to OpenAI, offering a
unique solution for those seeking local inferencing capabilities. It functions
as a seamless drop-in replacement REST API, fully compatible with OpenAI’s API
specifications. Designed to run Large Language Models (LLMs), generate images,
and produce audio among other functionalities, LocalAI is versatile in its
application. Notably, LocalAI supports multiple model families and
architectures, and it operates effectively without the need for a GPU.

4. PRACTICAL DEMONSTRATION

DEMO ENVIRONMENT

For guidance on configuring AWS EC2 CPU or GPU instances to run local LLMs in
Docker containers, refer to this blog.

RUNNING ON AN AWS CPU INSTANCE

For launching the demo on AWS CPU instances, here’s a concise step-by-step guide
to get you started:

1. Cloning this repo to your EC2 CPU instance:

git clone https://github.com/LinkTime-Corp/llm-in-containers.git

cd llm-in-containers/tabular-data-analysis

2. Insert your OpenAI API Key into conf/config.json for “OPENAI_API_KEY”. This
step can be skipped if you don’t want to evaluate against the OpenAI backend.

3. Download local models. If you have a problem with the wget command, you can
manually download the model from this link and save it in the ‘models’
directory.

bash download-models.sh

4. Launch the demo:

bash run.sh

5. Visit the UI at http://{IP of EC2 CPU instance}:8501.

6. Shut down the demo.

bash shutdown.sh

OPEN THE WEB UI

Now let’s play around with the UI by uploading some sample data:

1. Visit the provided link to download a set of sample CSV files. Unzip the
downloaded file. For the demo, we will use the file located at
“WikiTableQuestions/csv/200-csv/11.csv”.
2. Once on the UI, start by uploading a CSV file, such as the one you just
unzipped. Select the “LLM Type” for processing your query. You can choose
between “ChatGPT” and “Local_LLM”.
3. Choose the engine to query your tabular data file. There are two options:
“MixSelfConsistency” and “ChainOfTable”.
4. With these selections made, you can now ask questions related to the data in
your CSV file. For example, “Who won the Best Actor award in 1972?”. Click
the “Query” button to submit your question and receive the answer from the
chosen LLM.
5. On the sidebar of the UI, there’s an option to view the LLM inference trace.
This feature allows you to see the step-by-step processing of your question
by the LLM, providing insights into how the answer was derived.

Web UI for querying CSV files

RUNNING ON AN AWS GPU INSTANCE

For launching the demo on AWS GPU instances, here’s a concise step-by-step guide
to get you started. It is similar to the CPU instance setup, but with the
addition of the “-gpu” flag in the ‘run.sh’ and ‘shutdown.sh’ scripts:

1. Cloning this repo to your EC2 GPU instance:

git clone https://github.com/LinkTime-Corp/llm-in-containers.git

cd llm-in-containers/tabular-data-analysis

2. Insert your OpenAI API Key into conf/config.json for “OPENAI_API_KEY”.

3. Launch the demo:

bash run.sh -gpu

4. Visit the UI at http://{IP of EC2 GPU instance}:8501.

5. Shut down the demo.

bash shutdown.sh -gpu

5. CODE STRUCTURE

The code structure of the demo is designed to be simple and intuitive, with
different components organized into separate files for clarity and ease of
maintenance:

* main.py: This file contains the code for the User Interface (UI).
* backend.py: It’s responsible for handling the logic of choosing LLMs & Query
Engines and interacting with LLMs.
* constants.py: All the constants used throughout the codebase are defined
here.

6. UNRAVELING THE MECHANICS OF QUERY ENGINES

In the demonstration, we utilized a CSV file named “11.csv” from the
“WikiTableQuestions/csv/200-csv” sample dataset. This CSV file contains
structured tabular data shown in the below table. Let’s explore how the query
engines respond to the question, “Which nominee won the best actor for the
Academy Awards in 1972?”.

The CSV file used in the demo

6. 1 “MIXSELFCONSISTENCY” QUERY ENGINE

The “MixSelfConsistency” engine operates by cycling through two distinct types
of query paths, which are configurable in terms of iterations. These two paths
are called “Textual Reasoning” and “Symbolic Reasoning”.

“TEXTUAL REASONING” PATH

This path is straightforward. It operates by integrating the contents of a CSV
file directly into the prompt, thereby forming a comprehensive query that is
then presented to the LLM. Since we ran this path three times, we got the result
list:

['Gene Hackman', 'Gene Hackman', 'Gene Hackman'].

“SYMBOLIC REASONING” PATH

This path uses LlamaIndex’s PandasQueryEngine. This query engine loads a CSV
file into a pandas dataframe, and then generates panda instructions for the
given question to get the results. For the demo, we got three pandas
instructions, each corresponding to one of the three iterative runs.

First Run:

df[(df['Award'] == 'Academy Awards, 1972') & (df['Category'] == 'Best Actor') & (df['Result'] == 'Won')]['Nominee']

Second Run:

df[(df['Award'] == 'Academy Awards, 1972') & (df['Category'] == 'Best Actor') & (df['Result'] == 'Won')]['Nominee'].iloc[0]

Third Run:

df[(df['Award'] == 'Academy Awards, 1972') & (df['Category'] == 'Best Actor') & (df['Result'] == 'Won')]['Nominee']

So the result list is:

[
'2 Gene Hackman\nName: Nominee, dtype: object',
'Gene Hackman',
'2 Gene Hackman\nName: Nominee, dtype: object'
]

SELF CONSISTENCY AGGREGATION

This final process aggregates the counts of items that appear in the combined
list generated from both “Textual Reasoning” and “Symbolic Reasoning”. It then
returns the item with the highest count. In our demo, the item that emerged with
the maximum count, indicating it as the most likely answer, was ‘Gene Hackman’.

PROS AND CONS

The “MixSelfConsistency” query engine, by merging textual and symbolic reasoning
and utilizing a mixed self-consistency mechanism, boosts accuracy. However,
conducting multiple iterations of these query paths may lead to longer overall
response times.

6.2 “CHAINOFTABLE” QUERY ENGINE

The “ChainOfTable” engine initially employs a series of operations to modify the
original table into a format that directly addresses the query. Subsequently, it
combines this revised table and the question to create a prompt, enabling the
LLM to derive the final answer. This last phase bears resemblance to the
“Textual Reasoning” path in “MixSelfConsistency” engine.

CHAIN OF TABLE OPERATIONS

Let’s delve into its method of constructing a chain of operations. In each
iteration, the engine prompts the LLM to suggest the next operation, considering
both the current table state and a history of previous operations. The potential
operations include:

1. f_add_column(): This function is used to add a new column, especially when
the table requires additional inferred data to accurately respond to the
query.
2. f_select_row(): When only specific rows are relevant to the question, this
operation is employed to isolate and focus on these rows.
3. f_select_column(): Similar to row selection, this function narrows down the
table’s focus to certain columns deemed necessary for answering the query.
4. f_group_by(): For queries involving items with identical values and their
counts, this operation groups such items, enhancing clarity in the data
presentation.
5. f_sort_by(): If the query pertains to the order or ranking of items within a
column, this function sorts the items accordingly to align with the
question’s context.

DEMO CASE

Returning to our demo, let’s revisit the query, “Which nominee won the best
actor for the Academy Awards in 1972?”. In response, the “ChainOfTable” engine,
during its first iteration, executes the operation f_select_row([‘row 3’]). This
action results in the creation of a new table, structured as follows:

Result for f_select_row([‘row 3’])

The final query becomes straightforward, directly yielding the final answer:
“Gene Hackman.”

PROS AND CONS

The “ChainOfTable” engine creates a sequence of operations that transform the
original table into a version more aligned with the final question. This method
significantly enhances accuracy for queries whose answers are not immediately
apparent from the original table, necessitating a series of table operations for
clarification. However, this process requires each interaction with the LLM to
incorporate the current table’s content within the prompt. This approach can
impact the performance of LLMs, particularly when dealing with large tables, as
the size of the data directly influences the processing load.

6.3 PERFORMANCE COMPARISON

In the demonstration of our query engine, each execution of a query provides two
key pieces of information: the response from the query engine and the time taken
to generate that response. From our experiments using the CSV file in the demo,
we observed that the “MixSelfConsistency” engine tends to be faster and more
accurate compared to the “ChainOfTable” engine when ChatGPT is selected as the
LLM.

However, it’s important to note that these findings are not derived from a
systematic benchmark test or a comprehensive comparison of the two query
engines. The results we mention are based solely on our limited experiments. As
such, they should be viewed as preliminary observations rather than definitive
conclusions.

We encourage individuals who have an interest in this field to use our demo as a
starting point for more extensive comparisons or benchmark tests.

7. OTHER TAKEAWAYS

INTERACTING WITH LLMS

A critical aspect of the implementation in this demo is establishing the
connection to the APIs used for querying the LLMs. This includes setting up the
connection to the OpenAI API for ChatGPT and a similar API for local LLMs. Here
is the first part of the code:

self.openai_llm = OpenAI(model=OPENAI_API_MODEL)
self.local_llm = OpenAILike(
api_base=API_BASE,
api_key=API_KEY,
model=MODEL_NAME,
is_chat_model=True,
is_function_calling_model=True,
context_window=3900,
timeout=MAC_M1_LUNADEMO_CONSERVATIVE_TIMEOUT,
)

In addition to the above code, setting up the ServiceContext for these LLMs is
also important, especially for local LLMs. For a local LLM, it involves using a
local embed_model from Huggingface (from this
[document](https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html),
the local embed_model refers to “BAAI/bge-large-en”):

if llm_type == GPT_LLM:
chosen_llm = self.openai_llm
embed_model = OpenAIEmbedding(embed_batch_size=10)
service_context = ServiceContext.from_defaults(
chunk_size=1024, llm=chosen_llm, embed_model=embed_model)
else:
chosen_llm = self.local_llm
service_context = ServiceContext.from_defaults(
chunk_size=1024, llm=chosen_llm, embed_model="local")
set_global_service_context(service_context)

The above implementation is functional but is not considered ideal in terms of
flexibility and ease of use when switching between different LLMs, such as
OpenAI and local LLMs. In an ideal setup, classes like `OpenAI` or `OpenAILike`
should be capable of establishing connections with both OpenAI models and local
LLMs. This would be facilitated by simply specifying the `api_base` and
`api_key` for these compatible APIs.

As explained by LlamaIndex, the `OpenAILike` class is a thin wrapper around the
OpenAI model. Its purpose is to ensure compatibility with third-party tools that
offer an OpenAI-compatible API. However, a limitation arises as LlamaIndex
currently restricts the use of custom models with their `OpenAI` class,
primarily due to the need to infer certain metadata from the model’s name.

This limitation underscores the need for an optimized implementation in the
future. Such an enhancement would enable users to easily integrate and switch
between different LLMs, whether they are OpenAI or local LLMs.

OPENAI COMPATIBLE API SERVICE

The decision to utilize an OpenAI Compatible API Service as an intermediary
between the application and LLMs is strategic, and aimed at enhancing modularity
and flexibility. This approach allows for the seamless swapping of LLMs without
changing the code of applications, mitigating compatibility issues that may
arise from direct integration. Such a setup ensures that applications remain
agnostic to the specific LLMs they interact with, facilitating easier updates
and maintenance.

During the process of selecting an appropriate API service, our initial choice
was the GPT4all Rest API. As we all know, GPT4All is a nice tool to democratize
access to LLMs by enabling their use on standard hardware. However, right now
GPT4all Rest API is not compatible with the current OpenAI API specifications,
therefore we cannot switch backend LLM service. Subsequently, we evaluated
LocalAI, which proved to be compatible and functioned effectively with the
LlamaIndex OpenAILike class. This compatibility was crucial for our
requirements, demonstrating LocalAI’s adherence to current specifications and
its ability to integrate smoothly with our framework.

MANAGING THE SIZE OF DOCKER IMAGES

Choosing to run our demo within Docker containers was driven by the numerous
advantages Docker provides for LLM applications, such as enhanced portability,
reproducibility, scalability, security, resource efficiency, and simplified
deployment processes. However, during the process of constructing Docker images
for our demonstration, we observed a significant increase in the image size
following the installation of PyTorch. To address this and reduce the Docker
image size for CPU instances, we opted to install PyTorch directly from the
official wheel available at http://download.pytorch.org/whl/cpu:

pip install --no-cache-dir -r requirements.txt \
--extra-index-url https://download.pytorch.org/whl/cpu

This approach significantly reduced the compressed image size to just 435.06 MB,
compared to the considerably larger compressed size of 5.38 GB for GPU
instances. Adopting this strategy is particularly effective for those looking to
specifically build images tailored for CPU instances, offering a balance between
functionality and efficiency.

8. GITHUB LINK:

https://github.com/LinkTime-Corp/llm-in-containers/tree/main/tabular-data-analysis

FREE

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

MEMBERSHIP

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Try for US$5/month
Large Language Models
Tabular Data
Docker
Llamaindex

219

WRITTEN BY WENXIN SONG

83 Followers
·Writer for

GoPenAI

Cofounder & CTO @ LinkTimeCloud. Ph.D. in CS @ StonyBrook, ex-Big Data Engineer
@Ask .com, ex-Sr. Eng. Manager @EA . Passionate about K8s, Big Data, and LLMs.

blog.gopenai.com Open in urlscan Pro 162.159.152.4 Public Scan

Form analysis 0 forms found in the DOM

Text Content

blog.gopenai.com Open in urlscan Pro
162.159.152.4 Public Scan

Form analysis
0 forms found in the DOM