www.databytego.com Open in urlscan Pro
2606:4700:4400::6812:2418  Public Scan

URL: https://www.databytego.com/p/aillm-series-building-a-smarter-data
Submission: On September 17 via api from US — Scanned from DE

Form analysis 7 forms found in the DOM

<form class="_form_1h9fv_13"><input class="_emailInput_1h9fv_26" placeholder="Type your email...">
  <div id="error-container"></div><button
    class="pencraft pc-reset pencraft _buttonBase_1oht6_1 _button_1oht6_1 _buttonOld_1oht6_37 _buttonOldColors_1oht6_56 _priority_primary-theme_1oht6_250 _size_md_1oht6_127 _fill_filled_1oht6_368 _grow_1oht6_32 pc-justifyContent-center" tabindex="0"
    type="submit">Subscribe</button>
</form>

POST /api/v1/free?nojs=true

<form action="/api/v1/free?nojs=true" method="post" class="form _form_11q5m_6" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="first_referrer"
    value=""><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden" name="source"
    value="subscribe-widget-preamble"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
  <div class="_sideBySideWrap_11q5m_10">
    <div class="_emailInputWrapper_11q5m_57"><input type="email" name="email" placeholder="Type your email..." class="pencraft _emailInput_11q5m_23"></div><button type="submit" class="button rightButton primary subscribe-btn _button_11q5m_76"
      tabindex="0"><span class="button-text ">Subscribe</span></button>
  </div>
  <div id="error-container"></div>
</form>

POST /api/v1/free?nojs=true

<form action="/api/v1/free?nojs=true" method="post" class="form _form_11q5m_6" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="first_referrer"
    value=""><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden" name="source"
    value="subscribe-widget"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
  <div class="_sideBySideWrap_11q5m_10">
    <div class="_emailInputWrapper_11q5m_57"><input type="email" name="email" placeholder="Type your email..." class="pencraft _emailInput_11q5m_23"></div><button type="submit" class="button rightButton primary subscribe-btn _button_11q5m_76"
      tabindex="0"><span class="button-text ">Subscribe</span></button>
  </div>
  <div id="error-container"></div>
</form>

POST /api/v1/free?nojs=true

<form class="form _form_11q5m_6" action="/api/v1/free?nojs=true" method="post" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden"
    name="first_referrer"><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden"
    name="source" value="post-end-cta"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
  <div class="_sideBySideWrap_11q5m_10">
    <div class="_emailInputWrapper_11q5m_57"><input class="pencraft _emailInput_11q5m_23" type="email" name="email" placeholder="Type your email..."></div><button tabindex="0" type="submit"
      class="button rightButton primary subscribe-btn _button_11q5m_76"><span class="button-text ">Subscribe</span></button>
  </div>
  <div id="error-container"></div>
</form>

POST

<form method="post" class="form comment-input" novalidate="">
  <picture>
    <source type="image/webp" srcset="https://substackcdn.com/image/fetch/w_64,h_64,c_fill,f_webp,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack.com%2Fimg%2Favatars%2Flogged-out.png"><img
      src="https://substackcdn.com/image/fetch/w_64,h_64,c_fill,f_auto,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack.com%2Fimg%2Favatars%2Flogged-out.png" sizes="100vw" alt="" width="64" height="64" style="width: 32px; height: 32px;"
      class="_img_16u6n_1 _avatar_u4hgo_1 _object-fit-cover_16u6n_5 pencraft pc-reset">
  </picture>
  <div class="pencraft pc-display-flex pc-flexDirection-column _flexGrow_17s6c_230 pc-reset comment-input-right"><textarea data-gramm="false" data-gramm_editor="false" data-enable-grammarly="false" name="body" placeholder="Write a comment..."
      style="height: 96px;"></textarea>
    <div id="error-container"></div>
    <div class="pencraft pc-display-flex pc-paddingTop-8 pc-justifyContent-space-between pc-alignItems-center pc-reset"></div>
  </div>
</form>

POST /api/v1/free?nojs=true

<form action="/api/v1/free?nojs=true" method="post" class="form _form_11q5m_6" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="first_referrer"
    value=""><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden" name="source"
    value="subscribe_footer"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
  <div class="_sideBySideWrap_11q5m_10">
    <div class="_emailInputWrapper_11q5m_57"><input type="email" name="email" placeholder="Type your email..." class="pencraft _emailInput_11q5m_23 _emailInputOnAccentBackground_11q5m_49"></div><button type="submit"
      class="button rightButton primary subscribe-btn _button_11q5m_76 _buttonOnAccentBackground_11q5m_89" tabindex="0"><span class="button-text ">Subscribe</span></button>
  </div>
  <div id="error-container"></div>
</form>

POST /api/v1/user/profile

<form class="form " action="/api/v1/user/profile" method="post" novalidate=""><label for="name">Name (Required)</label><input autofocus="true" type="text" class="profile-name" placeholder="Type your name..." name="name" id="name"><label
    for="handle">Handle</label><input type="text" class="profile-name" placeholder="Type your handle..." name="handle" id="handle"><label for="bio">Bio</label><textarea class="profile-bio" placeholder="Say something about yourself..." name="bio"
    id="bio"></textarea><label for="email">Email (Required)</label><input type="email" class="profile-email" placeholder="Your email…" name="email"><label class="profile-signup-checkbox"><input type="checkbox" name="free_signup" checked=""> Subscribe
    to the newsletter</label><input type="hidden" name="confirmation_redirect_pathname" value="/p/aillm-series-building-a-smarter-data"><input type="hidden" name="photo_url"><input type="hidden" name="user_id"><input type="hidden" name="needs_photo"
    value="false"><input type="hidden" name="token">
  <div id="error-container"></div>
  <p class="left hidden">undefined subscriptions will be displayed on your profile (<a>edit</a>)</p>
  <div class="modal-ctas">
    <p class="skip hidden"><a class="small">Skip for now</a></p><button tabindex="0" type="submit" class="button primary">Save &amp; Post Comment</button>
  </div>
</form>

Text Content

DATABYTEGO - THE DATA BLOG


SubscribeSign in

Share this post

[AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO
ENHANCED ACCURACY AND PERFORMANCE

www.databytego.com
Copy link

Facebook

Email

Note

Other

DISCOVER MORE FROM DATABYTEGO - THE DATA BLOG

I write a monthly Data and database newsletter, host a podcast and just
generally try to be helpful.
Over 28,000 subscribers

Subscribe
Continue reading
Sign in


[AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO
ENHANCED ACCURACY AND PERFORMANCE


WANT TO GET THE MOST OUT OF YOUR LLM AND RAG SYSTEMS? IT ALL STARTS WITH A SOLID
DATA PIPELINE. IN THIS GUIDE, WE’LL BREAK DOWN HOW SIMPLE TECHNIQUES LIKE
CHUNKING AND CLEANING YOUR DATA CAN BOOST ACC

DataByteGo
Sep 17, 2024
Share this post

[AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO
ENHANCED ACCURACY AND PERFORMANCE

www.databytego.com
Copy link

Facebook

Email

Note

Other
Share



KEY TAKEAWAYS


 * Breaking data into chunks helps RAG systems work better. By splitting large
   datasets into smaller, manageable pieces, RAG systems can process information
   more accurately and provide relevant responses for search.

 * Clean data is key to getting good results. RAG systems rely on clean,
   well-organized data to deliver reliable answers. Messy or inconsistent data
   can lead to mistakes or irrelevant information.

 * Poorly chunked or unclean data hurts performance. If data isn’t properly
   chunked or cleaned, RAG systems might return incomplete, confusing, or
   inaccurate results, which can frustrate users.

 * A well-structured data pipeline makes all the difference. Having a smooth
   process for organizing and cleaning data ensures RAG systems run efficiently,
   leading to better insights and overall user experience.



Thanks for reading DataByteGo - The Data Blog! Subscribe for free to receive new
posts and support my work.


Subscribe





Today, every organization is turning to artificial intelligence (AI) not only
for its products and offerings but also to boost productivity and streamline
Operations. According to IBM’s, AI is making significant strides in customer
service, talent management, and modernizing applications. For instance, AI can
handle up to 70% of contact center cases, improving the customer experience by
providing faster and more accurate responses. In HR, AI-powered solutions are
boosting productivity by 40%, making tasks like candidate screening and employee
training more efficient. Additionally, AI is enhancing application modernization
by 30%, reducing the workload in IT operations through automation, such as
handling support tickets and managing incidents.

Source: https://www.ibm.com/think/topics/generative-ai-for-knowledge-management




HOW ARE ORGANIZATIONS MAKING IT HAPPEN?


One of the most impactful applications of AI is Retrieval-Augmented Generation
(RAG), which combines the power of Large Language Models (LLMs) with real-time
data retrieval. This combination is reshaping how businesses operate, helping
them respond more quickly, make better decisions, and deliver improved customer
experiences.

Tip: Learn about What’s RAG and how it works:
https://www.databytego.com/p/aiml-brief-introduction-to-retrieval

Some of the most common examples and use cases of RAG(and LLM) in organizations
are 

 * Customer Support: Automates responses by retrieving answers from knowledge
   bases for faster customer service.

 * Sales and Marketing: Generates personalized marketing content using customer
   data and market trends and identifying the most qualified customer based on
   certain behavior. 

 * Legal and Compliance: Analyzes contracts by retrieving key clauses and
   providing summaries or risk assessments.

 * Research and Development: Retrieves and synthesizes information from patents
   and scientific papers to guide innovation.

 * Financial Services: Produces detailed risk assessments by analyzing financial
   reports and market data.




WHAT DOES A TYPICAL RAG PIPELINE LOOK LIKE?


THIS IS WHAT A TYPICAL RAG PIPELINE LOOKS LIKE IN ANY ENTERPRISE. EACH DOTTED
BOX REPRESENTS EITHER A USE CASE OR THE PROJECT IN THE ORGANIZATION.



COMPANY→ BU/DEPARMENTS → USE CASES → PROJECTS → PIPELINES




 1. Connect Datasource:
    Establishes connections with various data sources like databases, file
    storage, and internal apps.This Enables pulling relevant data to feed the
    RAG model.

 2. Route:
    Directs the data from the sources to the appropriate downstream processes.
    This ensures the right data reaches the correct models or systems for
    optimized flow.

 3. Transform:Cleans, standardizes, and prepares data for use in machine
    learning workflows. This helps ensure consistent, clean data for accurate
    model results.

 4. Chunk:Breaks large datasets into manageable pieces, suitable for LLM
    processing.This allows the model to process data without exceeding context
    limitations.

 5. Embed:Converts text chunks into vector embeddings for semantic search and
    data retrieval. Embeddings make data searchable based on meaning, improving
    query results.

 6. Persist:Stores processed data and embeddings for future queries in a vector
    database or data lake. This ensures fast and efficient retrieval when needed
    for RAG workflows.


Subscribe






THE PROBLEM OF CREATING A BAD DATA PIPELINE FOR RAG


On the other hand, a poorly constructed data pipeline can severely undermine the
performance of RAG systems. With a reliable pipeline, data ingestion might be
consistent, leading to gaps or outdated information. If the data is not properly
cleaned or preprocessed, it can introduce noise, biases, or errors into the
system, resulting in inaccurate or irrelevant outputs. This could be
particularly damaging in enterprise applications where precision and reliability
are critical, such as in legal document analysis, financial forecasting, or
medical research.

A poorly designed data pipeline can cause problems like inconsistent data flow,
wrong information, or slow responses. This makes the RAG (Retrieval-Augmented
Generation) system less useful. If the pipeline doesn't clean or organize the
data well, the system may provide confusing or incorrect answers, frustrating
users and requiring more manual fixes.

EXAMPLE: CUSTOMER SUPPORT


Imagine a company using RAG to answer customer service questions. If the
pipeline doesn't clean up the data—removing things like special characters or
outdated information—the system might provide messy or irrelevant responses. For
example, if a customer asks, "How do I update my shipping address?" but the
system pulls outdated instructions or includes strange characters in its
response, it could confuse the customer and lead to frustration.

This kind of bad pipeline can also make the system slower, causing delays in
answering questions, which could reduce the trust customers have in the
company’s support service.

EXAMPLE: RESEARCH AND DEVELOPMENT


In a research and development team, an RAG system might be used to summarize
scientific papers and patents. If the data pipeline is poorly designed, it could
fail to properly segment the text into meaningful chunks or remove irrelevant
formatting. For instance, if the system tries to summarize a paper but the data
is jumbled with incorrect citations or non-standard characters, the summaries
might be misleading or incomplete. This can hinder the research process, slow
down innovation, and lead to incorrect conclusions.

Learn more about RAG use cases here
https://hyperight.com/7-practical-applications-of-rag-models-and-their-impact-on-society/




WHY DATA CHUNKING AND DATA CLEANING ARE CRUCIAL FOR RAG SYSTEMS


When using Retrieval-Augmented Generation (RAG) technology, having high-quality
data is essential for getting accurate and helpful results. Two key processes
that help ensure this quality are data chunking and data cleaning.

DATA CHUNKING: MAKING INFORMATION MANAGEABLE


Breaking up is hard to do: Chunking in RAG applications

https://stackoverflow.blog/2024/06/06/breaking-up-is-hard-to-do-chunking-in-rag-applications/

DATA CLEANING: KEEPING INFORMATION ACCURATE


Data chunking means breaking down large amounts of information into smaller,
easier-to-handle pieces. This helps the RAG system process data more
effectively.

Imagine you’re using a customer support system to find information in a long
product manual. If the manual is divided into clear sections—like setup
instructions, troubleshooting, and warranty info—the system can quickly find and
provide the right details when a customer asks for help. Proper chunking helps
the system avoid getting overwhelmed and ensures it gives precise answers.

Data cleaning involves fixing mistakes and removing unnecessary or confusing
information from the data. This is important because clean, accurate data helps
the RAG system give reliable responses.

For example, if a legal compliance system is analyzing contracts, and the data
includes strange symbols or outdated sections, the results might be incorrect or
confusing. Cleaning the data—by removing errors, fixing formatting issues, and
standardizing the text—makes sure the system delivers clear and correct
information.



Thanks for reading DataByteGo - The Data Blog! This post is public so feel free
to share it.

Share



--------------------------------------------------------------------------------

Thank you for reading through this blog. I hope you have learned something new
and interesting.

Considering the details, Depth, and insights on this topic, I will be publishing
part-2

 * Why data chunking and Data Normalization matter?

 * Types of data chunking and data normalization techniques.

 * Detailed examples.






SUBSCRIBE TO DATABYTEGO - THE DATA BLOG

Launched 2 years ago

I write a monthly Data and database newsletter, host a podcast and just
generally try to be helpful.


Subscribe


Share this post

[AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO
ENHANCED ACCURACY AND PERFORMANCE

www.databytego.com
Copy link

Facebook

Email

Note

Other
Share
Previous

DISCUSSION ABOUT THIS POST

Comments
Restacks


Top
Latest
Discussions

Navigating 4 stages of Cloud Privacy for AI and ML Strategies
Understand the 4 stages of cloud privacy to accerate your AI/ML adoption
May 13 • 
DataByteGo
6
Share this post

NAVIGATING 4 STAGES OF CLOUD PRIVACY FOR AI AND ML STRATEGIES

www.databytego.com
Copy link

Facebook

Email

Note

Other


The Rise of Modern Databases with AI/ML
Revolutionizing Data Management: AI/ML's Transformative Impact on Modern
Databases
Mar 22 • 
DataByteGo
5
Share this post

THE RISE OF MODERN DATABASES WITH AI/ML

www.databytego.com
Copy link

Facebook

Email

Note

Other


From Disaster to Recovery: Guide to PostgreSQL Backup Strategies (Part-1)
The only guide you need for Backup & Recovery Strategies
Jan 11 • 
DataByteGo
20
Share this post

FROM DISASTER TO RECOVERY: GUIDE TO POSTGRESQL BACKUP STRATEGIES (PART-1)

www.databytego.com
Copy link

Facebook

Email

Note

Other

See all


Ready for more?


Subscribe


© 2024 Ajay Patel
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great culture
Share
Copy link

Facebook

Email

Note

Other




CREATE YOUR PROFILE


Name (Required)HandleBioEmail (Required) Subscribe to the newsletter


undefined subscriptions will be displayed on your profile (edit)

Skip for now

Save & Post Comment


ONLY PAID SUBSCRIBERS CAN COMMENT ON THIS POST

Already a paid subscriber? Sign in

CHECK YOUR EMAIL

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

This site requires JavaScript to run correctly. Please turn on JavaScript or
unblock scripts