www.databytego.com Open in urlscan Pro
2606:4700:4400::6812:2418 Public Scan

Back to summary

URL:
https://www.databytego.com/p/aillm-series-building-a-smarter-data
Submission: On September 17 via api (September 17th 2024, 1:21:26 pm UTC) from US — Scanned from DE

Form analysis
7 forms found in the DOM

<form class="_form_1h9fv_13"><input class="_emailInput_1h9fv_26" placeholder="Type your email...">
  <div id="error-container"></div><button
    class="pencraft pc-reset pencraft _buttonBase_1oht6_1 _button_1oht6_1 _buttonOld_1oht6_37 _buttonOldColors_1oht6_56 _priority_primary-theme_1oht6_250 _size_md_1oht6_127 _fill_filled_1oht6_368 _grow_1oht6_32 pc-justifyContent-center" tabindex="0"
    type="submit">Subscribe</button>
</form>

POST /api/v1/free?nojs=true

<form action="/api/v1/free?nojs=true" method="post" class="form _form_11q5m_6" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="first_referrer"
    value=""><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden" name="source"
    value="subscribe-widget-preamble"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
  <div class="_sideBySideWrap_11q5m_10">
    <div class="_emailInputWrapper_11q5m_57"><input type="email" name="email" placeholder="Type your email..." class="pencraft _emailInput_11q5m_23"></div><button type="submit" class="button rightButton primary subscribe-btn _button_11q5m_76"
      tabindex="0"><span class="button-text ">Subscribe</span></button>
  </div>
  <div id="error-container"></div>
</form>

POST /api/v1/free?nojs=true

<form action="/api/v1/free?nojs=true" method="post" class="form _form_11q5m_6" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="first_referrer"
    value=""><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden" name="source"
    value="subscribe-widget"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
  <div class="_sideBySideWrap_11q5m_10">
    <div class="_emailInputWrapper_11q5m_57"><input type="email" name="email" placeholder="Type your email..." class="pencraft _emailInput_11q5m_23"></div><button type="submit" class="button rightButton primary subscribe-btn _button_11q5m_76"
      tabindex="0"><span class="button-text ">Subscribe</span></button>
  </div>
  <div id="error-container"></div>
</form>

POST /api/v1/free?nojs=true

<form class="form _form_11q5m_6" action="/api/v1/free?nojs=true" method="post" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden"
    name="first_referrer"><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden"
    name="source" value="post-end-cta"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
  <div class="_sideBySideWrap_11q5m_10">
    <div class="_emailInputWrapper_11q5m_57"><input class="pencraft _emailInput_11q5m_23" type="email" name="email" placeholder="Type your email..."></div><button tabindex="0" type="submit"
      class="button rightButton primary subscribe-btn _button_11q5m_76"><span class="button-text ">Subscribe</span></button>
  </div>
  <div id="error-container"></div>
</form>

POST

<form method="post" class="form comment-input" novalidate="">
  <picture>
    <source type="image/webp" srcset="https://substackcdn.com/image/fetch/w_64,h_64,c_fill,f_webp,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack.com%2Fimg%2Favatars%2Flogged-out.png"><img
      src="https://substackcdn.com/image/fetch/w_64,h_64,c_fill,f_auto,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack.com%2Fimg%2Favatars%2Flogged-out.png" sizes="100vw" alt="" width="64" height="64" style="width: 32px; height: 32px;"
      class="_img_16u6n_1 _avatar_u4hgo_1 _object-fit-cover_16u6n_5 pencraft pc-reset">
  </picture>
  <div class="pencraft pc-display-flex pc-flexDirection-column _flexGrow_17s6c_230 pc-reset comment-input-right"><textarea data-gramm="false" data-gramm_editor="false" data-enable-grammarly="false" name="body" placeholder="Write a comment..."
      style="height: 96px;"></textarea>
    <div id="error-container"></div>
    <div class="pencraft pc-display-flex pc-paddingTop-8 pc-justifyContent-space-between pc-alignItems-center pc-reset"></div>
  </div>
</form>

POST /api/v1/free?nojs=true

<form action="/api/v1/free?nojs=true" method="post" class="form _form_11q5m_6" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="first_referrer"
    value=""><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden" name="source"
    value="subscribe_footer"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
  <div class="_sideBySideWrap_11q5m_10">
    <div class="_emailInputWrapper_11q5m_57"><input type="email" name="email" placeholder="Type your email..." class="pencraft _emailInput_11q5m_23 _emailInputOnAccentBackground_11q5m_49"></div><button type="submit"
      class="button rightButton primary subscribe-btn _button_11q5m_76 _buttonOnAccentBackground_11q5m_89" tabindex="0"><span class="button-text ">Subscribe</span></button>
  </div>
  <div id="error-container"></div>
</form>

POST /api/v1/user/profile

<form class="form " action="/api/v1/user/profile" method="post" novalidate=""><label for="name">Name (Required)</label><input autofocus="true" type="text" class="profile-name" placeholder="Type your name..." name="name" id="name"><label
    for="handle">Handle</label><input type="text" class="profile-name" placeholder="Type your handle..." name="handle" id="handle"><label for="bio">Bio</label><textarea class="profile-bio" placeholder="Say something about yourself..." name="bio"
    id="bio"></textarea><label for="email">Email (Required)</label><input type="email" class="profile-email" placeholder="Your email…" name="email"><label class="profile-signup-checkbox"><input type="checkbox" name="free_signup" checked=""> Subscribe
    to the newsletter</label><input type="hidden" name="confirmation_redirect_pathname" value="/p/aillm-series-building-a-smarter-data"><input type="hidden" name="photo_url"><input type="hidden" name="user_id"><input type="hidden" name="needs_photo"
    value="false"><input type="hidden" name="token">
  <div id="error-container"></div>
  <p class="left hidden">undefined subscriptions will be displayed on your profile (<a>edit</a>)</p>
  <div class="modal-ctas">
    <p class="skip hidden"><a class="small">Skip for now</a></p><button tabindex="0" type="submit" class="button primary">Save &amp; Post Comment</button>
  </div>
</form>

Text Content

DATABYTEGO - THE DATA BLOG

SubscribeSign in

Share this post

[AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO
ENHANCED ACCURACY AND PERFORMANCE

www.databytego.com
Copy link

Facebook

Note

Other

DISCOVER MORE FROM DATABYTEGO - THE DATA BLOG

I write a monthly Data and database newsletter, host a podcast and just
generally try to be helpful.
Over 28,000 subscribers

Subscribe
Continue reading
Sign in

[AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO
ENHANCED ACCURACY AND PERFORMANCE

WANT TO GET THE MOST OUT OF YOUR LLM AND RAG SYSTEMS? IT ALL STARTS WITH A SOLID
DATA PIPELINE. IN THIS GUIDE, WE’LL BREAK DOWN HOW SIMPLE TECHNIQUES LIKE
CHUNKING AND CLEANING YOUR DATA CAN BOOST ACC

DataByteGo
Sep 17, 2024
Share this post

[AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO
ENHANCED ACCURACY AND PERFORMANCE

www.databytego.com
Copy link

Facebook

Note

Other
Share

KEY TAKEAWAYS

* Breaking data into chunks helps RAG systems work better. By splitting large
datasets into smaller, manageable pieces, RAG systems can process information
more accurately and provide relevant responses for search.

* Clean data is key to getting good results. RAG systems rely on clean,
well-organized data to deliver reliable answers. Messy or inconsistent data
can lead to mistakes or irrelevant information.

* Poorly chunked or unclean data hurts performance. If data isn’t properly
chunked or cleaned, RAG systems might return incomplete, confusing, or
inaccurate results, which can frustrate users.

* A well-structured data pipeline makes all the difference. Having a smooth
process for organizing and cleaning data ensures RAG systems run efficiently,
leading to better insights and overall user experience.

Thanks for reading DataByteGo - The Data Blog! Subscribe for free to receive new
posts and support my work.

Today, every organization is turning to artificial intelligence (AI) not only
for its products and offerings but also to boost productivity and streamline
Operations. According to IBM’s, AI is making significant strides in customer
service, talent management, and modernizing applications. For instance, AI can
handle up to 70% of contact center cases, improving the customer experience by
providing faster and more accurate responses. In HR, AI-powered solutions are
boosting productivity by 40%, making tasks like candidate screening and employee
training more efficient. Additionally, AI is enhancing application modernization
by 30%, reducing the workload in IT operations through automation, such as
handling support tickets and managing incidents.

Source: https://www.ibm.com/think/topics/generative-ai-for-knowledge-management

HOW ARE ORGANIZATIONS MAKING IT HAPPEN?

One of the most impactful applications of AI is Retrieval-Augmented Generation
(RAG), which combines the power of Large Language Models (LLMs) with real-time
data retrieval. This combination is reshaping how businesses operate, helping
them respond more quickly, make better decisions, and deliver improved customer
experiences.

Tip: Learn about What’s RAG and how it works:
https://www.databytego.com/p/aiml-brief-introduction-to-retrieval

Some of the most common examples and use cases of RAG(and LLM) in organizations
are

* Customer Support: Automates responses by retrieving answers from knowledge
bases for faster customer service.

* Sales and Marketing: Generates personalized marketing content using customer
data and market trends and identifying the most qualified customer based on
certain behavior.

* Legal and Compliance: Analyzes contracts by retrieving key clauses and
providing summaries or risk assessments.

* Research and Development: Retrieves and synthesizes information from patents
and scientific papers to guide innovation.

* Financial Services: Produces detailed risk assessments by analyzing financial
reports and market data.

WHAT DOES A TYPICAL RAG PIPELINE LOOK LIKE?

THIS IS WHAT A TYPICAL RAG PIPELINE LOOKS LIKE IN ANY ENTERPRISE. EACH DOTTED
BOX REPRESENTS EITHER A USE CASE OR THE PROJECT IN THE ORGANIZATION.

COMPANY→ BU/DEPARMENTS → USE CASES → PROJECTS → PIPELINES

1. Connect Datasource:
Establishes connections with various data sources like databases, file
storage, and internal apps.This Enables pulling relevant data to feed the
RAG model.

2. Route:
Directs the data from the sources to the appropriate downstream processes.
This ensures the right data reaches the correct models or systems for
optimized flow.

3. Transform:Cleans, standardizes, and prepares data for use in machine
learning workflows. This helps ensure consistent, clean data for accurate
model results.

4. Chunk:Breaks large datasets into manageable pieces, suitable for LLM
processing.This allows the model to process data without exceeding context
limitations.

5. Embed:Converts text chunks into vector embeddings for semantic search and
data retrieval. Embeddings make data searchable based on meaning, improving
query results.

6. Persist:Stores processed data and embeddings for future queries in a vector
database or data lake. This ensures fast and efficient retrieval when needed
for RAG workflows.

THE PROBLEM OF CREATING A BAD DATA PIPELINE FOR RAG

On the other hand, a poorly constructed data pipeline can severely undermine the
performance of RAG systems. With a reliable pipeline, data ingestion might be
consistent, leading to gaps or outdated information. If the data is not properly
cleaned or preprocessed, it can introduce noise, biases, or errors into the
system, resulting in inaccurate or irrelevant outputs. This could be
particularly damaging in enterprise applications where precision and reliability
are critical, such as in legal document analysis, financial forecasting, or
medical research.

A poorly designed data pipeline can cause problems like inconsistent data flow,
wrong information, or slow responses. This makes the RAG (Retrieval-Augmented
Generation) system less useful. If the pipeline doesn't clean or organize the
data well, the system may provide confusing or incorrect answers, frustrating
users and requiring more manual fixes.

EXAMPLE: CUSTOMER SUPPORT

Imagine a company using RAG to answer customer service questions. If the
pipeline doesn't clean up the data—removing things like special characters or
outdated information—the system might provide messy or irrelevant responses. For
example, if a customer asks, "How do I update my shipping address?" but the
system pulls outdated instructions or includes strange characters in its
response, it could confuse the customer and lead to frustration.

This kind of bad pipeline can also make the system slower, causing delays in
answering questions, which could reduce the trust customers have in the
company’s support service.

EXAMPLE: RESEARCH AND DEVELOPMENT

In a research and development team, an RAG system might be used to summarize
scientific papers and patents. If the data pipeline is poorly designed, it could
fail to properly segment the text into meaningful chunks or remove irrelevant
formatting. For instance, if the system tries to summarize a paper but the data
is jumbled with incorrect citations or non-standard characters, the summaries
might be misleading or incomplete. This can hinder the research process, slow
down innovation, and lead to incorrect conclusions.

Learn more about RAG use cases here
https://hyperight.com/7-practical-applications-of-rag-models-and-their-impact-on-society/

WHY DATA CHUNKING AND DATA CLEANING ARE CRUCIAL FOR RAG SYSTEMS

When using Retrieval-Augmented Generation (RAG) technology, having high-quality
data is essential for getting accurate and helpful results. Two key processes
that help ensure this quality are data chunking and data cleaning.

DATA CHUNKING: MAKING INFORMATION MANAGEABLE

Breaking up is hard to do: Chunking in RAG applications

https://stackoverflow.blog/2024/06/06/breaking-up-is-hard-to-do-chunking-in-rag-applications/

DATA CLEANING: KEEPING INFORMATION ACCURATE

Data chunking means breaking down large amounts of information into smaller,
easier-to-handle pieces. This helps the RAG system process data more
effectively.

Imagine you’re using a customer support system to find information in a long
product manual. If the manual is divided into clear sections—like setup
instructions, troubleshooting, and warranty info—the system can quickly find and
provide the right details when a customer asks for help. Proper chunking helps
the system avoid getting overwhelmed and ensures it gives precise answers.

Data cleaning involves fixing mistakes and removing unnecessary or confusing
information from the data. This is important because clean, accurate data helps
the RAG system give reliable responses.

For example, if a legal compliance system is analyzing contracts, and the data
includes strange symbols or outdated sections, the results might be incorrect or
confusing. Cleaning the data—by removing errors, fixing formatting issues, and
standardizing the text—makes sure the system delivers clear and correct
information.

Thanks for reading DataByteGo - The Data Blog! This post is public so feel free
to share it.

--------------------------------------------------------------------------------

Thank you for reading through this blog. I hope you have learned something new
and interesting.

Considering the details, Depth, and insights on this topic, I will be publishing
part-2

* Why data chunking and Data Normalization matter?

* Types of data chunking and data normalization techniques.

* Detailed examples.

SUBSCRIBE TO DATABYTEGO - THE DATA BLOG

Launched 2 years ago

I write a monthly Data and database newsletter, host a podcast and just
generally try to be helpful.

Share this post

[AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO
ENHANCED ACCURACY AND PERFORMANCE

www.databytego.com
Copy link

Facebook

Note

Other
Share
Previous

DISCUSSION ABOUT THIS POST

Comments
Restacks

Top
Latest
Discussions

Navigating 4 stages of Cloud Privacy for AI and ML Strategies
Understand the 4 stages of cloud privacy to accerate your AI/ML adoption
May 13 •
DataByteGo
6
Share this post

NAVIGATING 4 STAGES OF CLOUD PRIVACY FOR AI AND ML STRATEGIES

www.databytego.com
Copy link

Facebook

Note

Other

The Rise of Modern Databases with AI/ML
Revolutionizing Data Management: AI/ML's Transformative Impact on Modern
Databases
Mar 22 •
DataByteGo
5
Share this post

THE RISE OF MODERN DATABASES WITH AI/ML

www.databytego.com
Copy link

Facebook

Note

Other

From Disaster to Recovery: Guide to PostgreSQL Backup Strategies (Part-1)
The only guide you need for Backup & Recovery Strategies
Jan 11 •
DataByteGo
20
Share this post

FROM DISASTER TO RECOVERY: GUIDE TO POSTGRESQL BACKUP STRATEGIES (PART-1)

www.databytego.com
Copy link

Facebook

Note

Other

See all

Ready for more?

© 2024 Ajay Patel
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great culture
Share
Copy link

Facebook

Note

Other

CREATE YOUR PROFILE

Name (Required)HandleBioEmail (Required) Subscribe to the newsletter

undefined subscriptions will be displayed on your profile (edit)

Skip for now

Save & Post Comment

ONLY PAID SUBSCRIBERS CAN COMMENT ON THIS POST

Already a paid subscriber? Sign in

CHECK YOUR EMAIL

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

This site requires JavaScript to run correctly. Please turn on JavaScript or
unblock scripts

www.databytego.com Open in urlscan Pro 2606:4700:4400::6812:2418 Public Scan

Form analysis 7 forms found in the DOM

POST /api/v1/free?nojs=true

POST /api/v1/free?nojs=true

POST /api/v1/free?nojs=true

POST

POST /api/v1/free?nojs=true

POST /api/v1/user/profile

Text Content

www.databytego.com Open in urlscan Pro
2606:4700:4400::6812:2418 Public Scan

Form analysis
7 forms found in the DOM