www.databytego.com
Open in
urlscan Pro
2606:4700:4400::6812:2418
Public Scan
URL:
https://www.databytego.com/p/aillm-series-building-a-smarter-data
Submission: On September 17 via api from US — Scanned from DE
Submission: On September 17 via api from US — Scanned from DE
Form analysis
7 forms found in the DOM<form class="_form_1h9fv_13"><input class="_emailInput_1h9fv_26" placeholder="Type your email...">
<div id="error-container"></div><button
class="pencraft pc-reset pencraft _buttonBase_1oht6_1 _button_1oht6_1 _buttonOld_1oht6_37 _buttonOldColors_1oht6_56 _priority_primary-theme_1oht6_250 _size_md_1oht6_127 _fill_filled_1oht6_368 _grow_1oht6_32 pc-justifyContent-center" tabindex="0"
type="submit">Subscribe</button>
</form>
POST /api/v1/free?nojs=true
<form action="/api/v1/free?nojs=true" method="post" class="form _form_11q5m_6" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="first_referrer"
value=""><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden" name="source"
value="subscribe-widget-preamble"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
<div class="_sideBySideWrap_11q5m_10">
<div class="_emailInputWrapper_11q5m_57"><input type="email" name="email" placeholder="Type your email..." class="pencraft _emailInput_11q5m_23"></div><button type="submit" class="button rightButton primary subscribe-btn _button_11q5m_76"
tabindex="0"><span class="button-text ">Subscribe</span></button>
</div>
<div id="error-container"></div>
</form>
POST /api/v1/free?nojs=true
<form action="/api/v1/free?nojs=true" method="post" class="form _form_11q5m_6" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="first_referrer"
value=""><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden" name="source"
value="subscribe-widget"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
<div class="_sideBySideWrap_11q5m_10">
<div class="_emailInputWrapper_11q5m_57"><input type="email" name="email" placeholder="Type your email..." class="pencraft _emailInput_11q5m_23"></div><button type="submit" class="button rightButton primary subscribe-btn _button_11q5m_76"
tabindex="0"><span class="button-text ">Subscribe</span></button>
</div>
<div id="error-container"></div>
</form>
POST /api/v1/free?nojs=true
<form class="form _form_11q5m_6" action="/api/v1/free?nojs=true" method="post" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden"
name="first_referrer"><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden"
name="source" value="post-end-cta"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
<div class="_sideBySideWrap_11q5m_10">
<div class="_emailInputWrapper_11q5m_57"><input class="pencraft _emailInput_11q5m_23" type="email" name="email" placeholder="Type your email..."></div><button tabindex="0" type="submit"
class="button rightButton primary subscribe-btn _button_11q5m_76"><span class="button-text ">Subscribe</span></button>
</div>
<div id="error-container"></div>
</form>
POST
<form method="post" class="form comment-input" novalidate="">
<picture>
<source type="image/webp" srcset="https://substackcdn.com/image/fetch/w_64,h_64,c_fill,f_webp,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack.com%2Fimg%2Favatars%2Flogged-out.png"><img
src="https://substackcdn.com/image/fetch/w_64,h_64,c_fill,f_auto,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack.com%2Fimg%2Favatars%2Flogged-out.png" sizes="100vw" alt="" width="64" height="64" style="width: 32px; height: 32px;"
class="_img_16u6n_1 _avatar_u4hgo_1 _object-fit-cover_16u6n_5 pencraft pc-reset">
</picture>
<div class="pencraft pc-display-flex pc-flexDirection-column _flexGrow_17s6c_230 pc-reset comment-input-right"><textarea data-gramm="false" data-gramm_editor="false" data-enable-grammarly="false" name="body" placeholder="Write a comment..."
style="height: 96px;"></textarea>
<div id="error-container"></div>
<div class="pencraft pc-display-flex pc-paddingTop-8 pc-justifyContent-space-between pc-alignItems-center pc-reset"></div>
</div>
</form>
POST /api/v1/free?nojs=true
<form action="/api/v1/free?nojs=true" method="post" class="form _form_11q5m_6" novalidate=""><input type="hidden" name="first_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="first_referrer"
value=""><input type="hidden" name="current_url" value="https://www.databytego.com/p/aillm-series-building-a-smarter-data"><input type="hidden" name="current_referrer"><input type="hidden" name="referral_code"><input type="hidden" name="source"
value="subscribe_footer"><input type="hidden" name="referring_pub_id"><input type="hidden" name="additional_referring_pub_ids">
<div class="_sideBySideWrap_11q5m_10">
<div class="_emailInputWrapper_11q5m_57"><input type="email" name="email" placeholder="Type your email..." class="pencraft _emailInput_11q5m_23 _emailInputOnAccentBackground_11q5m_49"></div><button type="submit"
class="button rightButton primary subscribe-btn _button_11q5m_76 _buttonOnAccentBackground_11q5m_89" tabindex="0"><span class="button-text ">Subscribe</span></button>
</div>
<div id="error-container"></div>
</form>
POST /api/v1/user/profile
<form class="form " action="/api/v1/user/profile" method="post" novalidate=""><label for="name">Name (Required)</label><input autofocus="true" type="text" class="profile-name" placeholder="Type your name..." name="name" id="name"><label
for="handle">Handle</label><input type="text" class="profile-name" placeholder="Type your handle..." name="handle" id="handle"><label for="bio">Bio</label><textarea class="profile-bio" placeholder="Say something about yourself..." name="bio"
id="bio"></textarea><label for="email">Email (Required)</label><input type="email" class="profile-email" placeholder="Your email…" name="email"><label class="profile-signup-checkbox"><input type="checkbox" name="free_signup" checked=""> Subscribe
to the newsletter</label><input type="hidden" name="confirmation_redirect_pathname" value="/p/aillm-series-building-a-smarter-data"><input type="hidden" name="photo_url"><input type="hidden" name="user_id"><input type="hidden" name="needs_photo"
value="false"><input type="hidden" name="token">
<div id="error-container"></div>
<p class="left hidden">undefined subscriptions will be displayed on your profile (<a>edit</a>)</p>
<div class="modal-ctas">
<p class="skip hidden"><a class="small">Skip for now</a></p><button tabindex="0" type="submit" class="button primary">Save & Post Comment</button>
</div>
</form>
Text Content
DATABYTEGO - THE DATA BLOG SubscribeSign in Share this post [AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO ENHANCED ACCURACY AND PERFORMANCE www.databytego.com Copy link Facebook Email Note Other DISCOVER MORE FROM DATABYTEGO - THE DATA BLOG I write a monthly Data and database newsletter, host a podcast and just generally try to be helpful. Over 28,000 subscribers Subscribe Continue reading Sign in [AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO ENHANCED ACCURACY AND PERFORMANCE WANT TO GET THE MOST OUT OF YOUR LLM AND RAG SYSTEMS? IT ALL STARTS WITH A SOLID DATA PIPELINE. IN THIS GUIDE, WE’LL BREAK DOWN HOW SIMPLE TECHNIQUES LIKE CHUNKING AND CLEANING YOUR DATA CAN BOOST ACC DataByteGo Sep 17, 2024 Share this post [AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO ENHANCED ACCURACY AND PERFORMANCE www.databytego.com Copy link Facebook Email Note Other Share KEY TAKEAWAYS * Breaking data into chunks helps RAG systems work better. By splitting large datasets into smaller, manageable pieces, RAG systems can process information more accurately and provide relevant responses for search. * Clean data is key to getting good results. RAG systems rely on clean, well-organized data to deliver reliable answers. Messy or inconsistent data can lead to mistakes or irrelevant information. * Poorly chunked or unclean data hurts performance. If data isn’t properly chunked or cleaned, RAG systems might return incomplete, confusing, or inaccurate results, which can frustrate users. * A well-structured data pipeline makes all the difference. Having a smooth process for organizing and cleaning data ensures RAG systems run efficiently, leading to better insights and overall user experience. Thanks for reading DataByteGo - The Data Blog! Subscribe for free to receive new posts and support my work. Subscribe Today, every organization is turning to artificial intelligence (AI) not only for its products and offerings but also to boost productivity and streamline Operations. According to IBM’s, AI is making significant strides in customer service, talent management, and modernizing applications. For instance, AI can handle up to 70% of contact center cases, improving the customer experience by providing faster and more accurate responses. In HR, AI-powered solutions are boosting productivity by 40%, making tasks like candidate screening and employee training more efficient. Additionally, AI is enhancing application modernization by 30%, reducing the workload in IT operations through automation, such as handling support tickets and managing incidents. Source: https://www.ibm.com/think/topics/generative-ai-for-knowledge-management HOW ARE ORGANIZATIONS MAKING IT HAPPEN? One of the most impactful applications of AI is Retrieval-Augmented Generation (RAG), which combines the power of Large Language Models (LLMs) with real-time data retrieval. This combination is reshaping how businesses operate, helping them respond more quickly, make better decisions, and deliver improved customer experiences. Tip: Learn about What’s RAG and how it works: https://www.databytego.com/p/aiml-brief-introduction-to-retrieval Some of the most common examples and use cases of RAG(and LLM) in organizations are * Customer Support: Automates responses by retrieving answers from knowledge bases for faster customer service. * Sales and Marketing: Generates personalized marketing content using customer data and market trends and identifying the most qualified customer based on certain behavior. * Legal and Compliance: Analyzes contracts by retrieving key clauses and providing summaries or risk assessments. * Research and Development: Retrieves and synthesizes information from patents and scientific papers to guide innovation. * Financial Services: Produces detailed risk assessments by analyzing financial reports and market data. WHAT DOES A TYPICAL RAG PIPELINE LOOK LIKE? THIS IS WHAT A TYPICAL RAG PIPELINE LOOKS LIKE IN ANY ENTERPRISE. EACH DOTTED BOX REPRESENTS EITHER A USE CASE OR THE PROJECT IN THE ORGANIZATION. COMPANY→ BU/DEPARMENTS → USE CASES → PROJECTS → PIPELINES 1. Connect Datasource: Establishes connections with various data sources like databases, file storage, and internal apps.This Enables pulling relevant data to feed the RAG model. 2. Route: Directs the data from the sources to the appropriate downstream processes. This ensures the right data reaches the correct models or systems for optimized flow. 3. Transform:Cleans, standardizes, and prepares data for use in machine learning workflows. This helps ensure consistent, clean data for accurate model results. 4. Chunk:Breaks large datasets into manageable pieces, suitable for LLM processing.This allows the model to process data without exceeding context limitations. 5. Embed:Converts text chunks into vector embeddings for semantic search and data retrieval. Embeddings make data searchable based on meaning, improving query results. 6. Persist:Stores processed data and embeddings for future queries in a vector database or data lake. This ensures fast and efficient retrieval when needed for RAG workflows. Subscribe THE PROBLEM OF CREATING A BAD DATA PIPELINE FOR RAG On the other hand, a poorly constructed data pipeline can severely undermine the performance of RAG systems. With a reliable pipeline, data ingestion might be consistent, leading to gaps or outdated information. If the data is not properly cleaned or preprocessed, it can introduce noise, biases, or errors into the system, resulting in inaccurate or irrelevant outputs. This could be particularly damaging in enterprise applications where precision and reliability are critical, such as in legal document analysis, financial forecasting, or medical research. A poorly designed data pipeline can cause problems like inconsistent data flow, wrong information, or slow responses. This makes the RAG (Retrieval-Augmented Generation) system less useful. If the pipeline doesn't clean or organize the data well, the system may provide confusing or incorrect answers, frustrating users and requiring more manual fixes. EXAMPLE: CUSTOMER SUPPORT Imagine a company using RAG to answer customer service questions. If the pipeline doesn't clean up the data—removing things like special characters or outdated information—the system might provide messy or irrelevant responses. For example, if a customer asks, "How do I update my shipping address?" but the system pulls outdated instructions or includes strange characters in its response, it could confuse the customer and lead to frustration. This kind of bad pipeline can also make the system slower, causing delays in answering questions, which could reduce the trust customers have in the company’s support service. EXAMPLE: RESEARCH AND DEVELOPMENT In a research and development team, an RAG system might be used to summarize scientific papers and patents. If the data pipeline is poorly designed, it could fail to properly segment the text into meaningful chunks or remove irrelevant formatting. For instance, if the system tries to summarize a paper but the data is jumbled with incorrect citations or non-standard characters, the summaries might be misleading or incomplete. This can hinder the research process, slow down innovation, and lead to incorrect conclusions. Learn more about RAG use cases here https://hyperight.com/7-practical-applications-of-rag-models-and-their-impact-on-society/ WHY DATA CHUNKING AND DATA CLEANING ARE CRUCIAL FOR RAG SYSTEMS When using Retrieval-Augmented Generation (RAG) technology, having high-quality data is essential for getting accurate and helpful results. Two key processes that help ensure this quality are data chunking and data cleaning. DATA CHUNKING: MAKING INFORMATION MANAGEABLE Breaking up is hard to do: Chunking in RAG applications https://stackoverflow.blog/2024/06/06/breaking-up-is-hard-to-do-chunking-in-rag-applications/ DATA CLEANING: KEEPING INFORMATION ACCURATE Data chunking means breaking down large amounts of information into smaller, easier-to-handle pieces. This helps the RAG system process data more effectively. Imagine you’re using a customer support system to find information in a long product manual. If the manual is divided into clear sections—like setup instructions, troubleshooting, and warranty info—the system can quickly find and provide the right details when a customer asks for help. Proper chunking helps the system avoid getting overwhelmed and ensures it gives precise answers. Data cleaning involves fixing mistakes and removing unnecessary or confusing information from the data. This is important because clean, accurate data helps the RAG system give reliable responses. For example, if a legal compliance system is analyzing contracts, and the data includes strange symbols or outdated sections, the results might be incorrect or confusing. Cleaning the data—by removing errors, fixing formatting issues, and standardizing the text—makes sure the system delivers clear and correct information. Thanks for reading DataByteGo - The Data Blog! This post is public so feel free to share it. Share -------------------------------------------------------------------------------- Thank you for reading through this blog. I hope you have learned something new and interesting. Considering the details, Depth, and insights on this topic, I will be publishing part-2 * Why data chunking and Data Normalization matter? * Types of data chunking and data normalization techniques. * Detailed examples. SUBSCRIBE TO DATABYTEGO - THE DATA BLOG Launched 2 years ago I write a monthly Data and database newsletter, host a podcast and just generally try to be helpful. Subscribe Share this post [AI/LLM SERIES] BUILDING A SMARTER DATA PIPELINE FOR LLM & RAG: THE KEY TO ENHANCED ACCURACY AND PERFORMANCE www.databytego.com Copy link Facebook Email Note Other Share Previous DISCUSSION ABOUT THIS POST Comments Restacks Top Latest Discussions Navigating 4 stages of Cloud Privacy for AI and ML Strategies Understand the 4 stages of cloud privacy to accerate your AI/ML adoption May 13 • DataByteGo 6 Share this post NAVIGATING 4 STAGES OF CLOUD PRIVACY FOR AI AND ML STRATEGIES www.databytego.com Copy link Facebook Email Note Other The Rise of Modern Databases with AI/ML Revolutionizing Data Management: AI/ML's Transformative Impact on Modern Databases Mar 22 • DataByteGo 5 Share this post THE RISE OF MODERN DATABASES WITH AI/ML www.databytego.com Copy link Facebook Email Note Other From Disaster to Recovery: Guide to PostgreSQL Backup Strategies (Part-1) The only guide you need for Backup & Recovery Strategies Jan 11 • DataByteGo 20 Share this post FROM DISASTER TO RECOVERY: GUIDE TO POSTGRESQL BACKUP STRATEGIES (PART-1) www.databytego.com Copy link Facebook Email Note Other See all Ready for more? Subscribe © 2024 Ajay Patel Privacy ∙ Terms ∙ Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other CREATE YOUR PROFILE Name (Required)HandleBioEmail (Required) Subscribe to the newsletter undefined subscriptions will be displayed on your profile (edit) Skip for now Save & Post Comment ONLY PAID SUBSCRIBERS CAN COMMENT ON THIS POST Already a paid subscriber? Sign in CHECK YOUR EMAIL For your security, we need to re-authenticate you. Click the link we sent to , or click here to sign in. This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts