blog.gopenai.com Open in urlscan Pro
162.159.153.4  Public Scan

Submitted URL: https://blog.gopenai.com/fine-tuning-embeddings-for-specific-domains-a-comprehensive-guide-5e4298b42185?source=email-547b...
Effective URL: https://blog.gopenai.com/fine-tuning-embeddings-for-specific-domains-a-comprehensive-guide-5e4298b42185?gi=a53a20f1a6cd&s...
Submission: On November 01 via api from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

Open in app

Sign up

Sign in

Write


Sign up

Sign in



Top highlight


FINE-TUNING EMBEDDINGS FOR SPECIFIC DOMAINS: A COMPREHENSIVE GUIDE

kirouane Ayoub

·

Follow

Published in

GoPenAI

·
11 min read
·
Sep 30, 2024

556

2

Listen

Share

Imagine you’re building a question answering system for a medical domain. You
want to ensure it can accurately retrieve relevant medical articles when a user
asks a question. But generic embedding models might struggle with the highly
specialized vocabulary and nuances of medical terminology.

That’s where fine-tuning comes in !!

In this blog post, we’ll delve into the process of fine-tuning an embedding
model for a specific domain, like medicine, law, or finance. We’ll generate a
dataset specifically for your domain and use it to train the model to better
understand the subtle language patterns and concepts within your chosen field.

> By the end, you’ll have a more powerful embedding model that’s optimized for
> your domain, enabling more accurate retrieval and improved results for your
> NLP tasks.


EMBEDDINGS: UNDERSTANDING THE CONCEPT



Embeddings are powerful numerical representations of text or image that capture
semantic relationships. Imagine a text or audio as a point in a
multi-dimensional space, where similar words or phrases are located closer
together than dissimilar ones.



Embeddings are essential for many NLP tasks like :



Semantic Similarity: Finding how similar two pieces of images or text are.

Text Classification: Grouping your data into categories based on their meaning.

Question Answering: Finding the most relevant document to answer a question.

Retrieval Augmented Generation (RAG): Combining an embedding model for retrieval
and a language model for text generation to improve the quality and relevance of
generated text.


MATRYOSHKA REPRESENTATION LEARNING



Matryoshka Representation Learning (MRL) is a technique for creating
“truncatable” embedding vectors. Imagine a series of nested dolls, with each
doll containing a smaller one inside. MRL embeds text in a way that the earlier
dimensions (like the outer dolls) contain the most important information, and
subsequent dimensions add detail. This allows you to use only a portion of the
embedding vector when needed, reducing storage and computation costs.




BGE-BASE-EN

The BAAI/bge-base-en-v1.5 model, developed by BAAI (Beijing Academy of
Artificial Intelligence), is a powerful text embedding model. It excels at
various NLP tasks and has been shown to perform well on benchmarks like MTEB and
C-MTEB. The bge-base-en model is a good choice for applications with limited
computing resources (like my case).


WHY FINE-TUNE EMBEDDINGS ?

Fine-tuning an embedding model for a specific domain is crucial for optimizing
RAG systems. This process ensures that the model’s understanding of similarity
aligns with the specific context and language nuances of your domain. A
fine-tuned embedding model is better equipped to retrieve the most relevant
documents for a question, ultimately leading to more accurate and relevant
responses from your RAG system.


DATASET FORMATS: BUILDING THE FOUNDATION FOR FINE-TUNING

You can use various dataset formats for fine-tuning.

Here are the most common types:

 * Positive Pair: A pair of related sentences (e.g.,questions , answers) .
 * Triplets: (anchor, positive, negative) triplets, where the anchor is similar
   to the positive and dissimilar to the negative.
 * Pair with Similarity Score: A pair of sentences with a similarity score
   indicating their relationship.
 * Texts with Classes: A text with its corresponding class label.

In this blog post, we will create a dataset of questions , answers pairs to
fine-tune our bge-base-en-v1.5 model.


LOSS FUNCTIONS: GUIDING THE TRAINING PROCESS

Loss functions are crucial for training embedding models. They measure the
discrepancy between the model’s predictions and the actual labels, providing a
signal for the model to adjust its weights.

Different loss functions are suitable for different dataset formats:

 * Triplet Loss: Used with (anchor, positive, negative) triplets to encourage
   the model to place similar sentences closer together and dissimilar sentences
   farther apart.
 * Contrastive Loss: Used with positive and negative pairs, encouraging similar
   sentences to be close and dissimilar sentences to be distant.
 * Cosine Similarity Loss: Used with pairs of sentences and a similarity score,
   encouraging the model to produce embeddings with cosine similarities that
   match the provided scores.
 * Matryoshka Loss: A specialized loss function designed to create Matryoshka
   embeddings, where the embeddings are truncatable.


CODE EXAMPLE


INSTALLING DEPENDENCIES

We start with installing essential libraries. We’ll use datasets,
sentence-transformers, and google-generativeai for handling datasets, embedding
models, and text generation.

apt-get -qq install poppler-utils tesseract-ocr
pip install datasets sentence-transformers google-generativeai
pip install -q --user --upgrade pillow
pip install -q unstructured["all-docs"] pi_heif
pip install -q --upgrade unstructured
pip install --upgrade nltk

We’ll also install unstructured for PDF parsing and nltk for text processing.


PDF PARSING AND TEXT EXTRACTION

We’ll use the unstructured library to extract text and tables from PDF files.

import nltk
import os 
from unstructured.partition.pdf import partition_pdf
from collections import Counter
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab') 

def process_pdfs_in_folder(folder_path):
    total_text = []  # To accumulate the text from all PDFs

    # Get list of all PDF files in the folder
    pdf_files = [f for f in os.listdir(folder_path) if f.endswith('.pdf')]

    for pdf_file in pdf_files:
        pdf_path = os.path.join(folder_path, pdf_file)
        print(f"Processing: {pdf_path}")

        # Apply the partition logic
        elements = partition_pdf(pdf_path, strategy="auto")

        # Display the types of elements
        display(Counter(type(element) for element in elements))

        # Join the elements to form text and add it to total_text list
        text = "\n\n".join([str(el) for el in elements])
        total_text.append(text)

    # Return the total concatenated text
    return "\n\n".join(total_text)


folder_path = "data"
all_text = process_pdfs_in_folder(folder_path)

We go through each PDF in a specified folder and partition the content into
text, tables, and figures.

We then combine the text elements into a single text representation.


CUSTOM TEXT CHUNKING

we break now the extracted text into manageable chunks using nltk. This is
essential for making the text more suitable for processing by the llm.

import nltk

nltk.download('punkt')

def nltk_based_splitter(text: str, chunk_size: int, overlap: int) -> list:
    """
    Splits the input text into chunks of a specified size, with optional overlap between chunks.

    Parameters:
    - text: The input text to be split.
    - chunk_size: The maximum size of each chunk (in terms of characters).
    - overlap: The number of overlapping characters between consecutive chunks.

    Returns:
    - A list of text chunks, with or without overlap.
    """

    from nltk.tokenize import sent_tokenize

    # Tokenize the input text into individual sentences
    sentences = sent_tokenize(text)

    chunks = []
    current_chunk = ""

    for sentence in sentences:
        # If the current chunk plus the next sentence doesn't exceed the chunk size, add the sentence to the chunk
        if len(current_chunk) + len(sentence) <= chunk_size:
            current_chunk += " " + sentence
        else:
            # Otherwise, add the current chunk to the list of chunks and start a new chunk with the current sentence
            chunks.append(current_chunk.strip())  # Strip to remove leading spaces
            current_chunk = sentence

    # After the loop, if there is any leftover text in the current chunk, add it to the list of chunks
    if current_chunk:
        chunks.append(current_chunk.strip())

    # Handle overlap if it's specified (overlap > 0)
    if overlap > 0:
        overlapping_chunks = []
        for i in range(len(chunks)):
            if i > 0:
                # Calculate the start index for overlap from the previous chunk
                start_overlap = max(0, len(chunks[i-1]) - overlap)
                # Combine the overlapping portion of the previous chunk with the current chunk
                chunk_with_overlap = chunks[i-1][start_overlap:] + " " + chunks[i]
                # Append the combined chunk, making sure it's not longer than chunk_size
                overlapping_chunks.append(chunk_with_overlap[:chunk_size])
            else:
                # For the first chunk, there's no previous chunk to overlap with
                overlapping_chunks.append(chunks[i][:chunk_size])

        return overlapping_chunks  # Return the list of chunks with overlap

    # If overlap is 0, return the non-overlapping chunks
    return chunks

chunks = nltk_based_splitter(text=all_text, 
                                  chunk_size=2048,
                                  overlap=0)


DATASET GENERATOR

In this section we define two functions:

The prompt function creates a prompt for Google Gemini, requesting a
Question-Answer pair based on a provided text chunk.

import google.generativeai as genai
import pandas as pd

# Replace with your valid Google API key
GOOGLE_API_KEY = "xxxxxxxxxxxx"

# Prompt generator with an explicit request for structured output
def prompt(text_chunk):
    return f"""
    Based on the following text, generate one Question and its corresponding Answer.
    Please format the output as follows:
    Question: [Your question]
    Answer: [Your answer]

    Text: {text_chunk}
    """
# Function to interact with Google's Gemini and return a QA pair
def generate_with_gemini(text_chunk:str, temperature:float, model_name:str):
    genai.configure(api_key=GOOGLE_API_KEY)
    generation_config = {"temperature": temperature}

    # Initialize the generative model
    gen_model = genai.GenerativeModel(model_name, generation_config=generation_config)

    # Generate response based on the prompt
    response = gen_model.generate_content(prompt(text_chunk))

    # Extract question and answer from response using keyword
    try:
        question, answer = response.text.split("Answer:", 1)
        question = question.replace("Question:", "").strip()
        answer = answer.strip()
    except ValueError:
        question, answer = "N/A", "N/A"  # Handle unexpected format in response

    return question, answer

The generate_with_gemini function interacts with the Gemini model and generates
a QA pair using the created prompt.


RUNNING Q&A GENERATION

Using the process_text_chunks function, we generate QA pairs for each text chunk
using the Gemini model.

def process_text_chunks(text_chunks:list, temperature:int, model_name=str):
    """
    Processes a list of text chunks to generate questions and answers using a specified model.

    Parameters:
    - text_chunks: A list of text chunks to process.
    - temperature: The sampling temperature to control randomness in the generated outputs.
    - model_name: The name of the model to use for generating questions and answers.

    Returns:
    - A Pandas DataFrame containing the text chunks, questions, and answers.
    """
    results = []

    # Iterate through each text chunk
    for chunk in text_chunks:
        question, answer = generate_with_gemini(chunk, temperature, model_name)
        results.append({"Text Chunk": chunk, "Question": question, "Answer": answer})

    # Convert results into a Pandas DataFrame
    df = pd.DataFrame(results)
    return df
# Process the text chunks and get the DataFrame
df_results = process_text_chunks(text_chunks=chunks, 
                                 temperature=0.7, 
                                 model_name="gemini-1.5-flash")
df_results.to_csv("generated_qa_pairs.csv", index=False)

These results are then stored in a Pandas DataFrame.


LOADING THE DATASET

Next, we load the generated QA pairs from the CSV file into a HuggingFace
dataset. We make sure the data is in the correct format for fine-tuning.

from datasets import load_dataset

# Load the CSV file into a Hugging Face Dataset
dataset = load_dataset('csv', data_files='generated_qa_pairs.csv')

def process_example(example, idx):
    return {
        "id": idx,  # Add unique ID based on the index
        "anchor": example["Question"],
        "positive": example["Answer"]
    }
dataset = dataset.map(process_example,
                      with_indices=True , 
                      remove_columns=["Text Chunk", "Question", "Answer"])


LOADING THE MODEL

We load the BAAI/bge-base-en-v1.5 model from HuggingFace, making sure to choose
the appropriate device for execution (CPU or GPU).

import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator,
    SequentialEvaluator,
)
from sentence_transformers.util import cos_sim
from datasets import load_dataset, concatenate_datasets
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss


model_id = "BAAI/bge-base-en-v1.5" 

# Load a model
model = SentenceTransformer(
    model_id, device="cuda" if torch.cuda.is_available() else "cpu"
)


DEFINING THE LOSS FUNCTION

Here, we configure the Matryoshka loss function, specifying the dimensions to be
used for the truncated embeddings.

# Important: large to small
matryoshka_dimensions = [768, 512, 256, 128, 64] 
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

The inner loss function, MultipleNegativesRankingLoss, helps the model produce
embeddings suitable for retrieval tasks.


DEFINING TRAINING ARGUMENTS

We use SentenceTransformerTrainingArguments to define the training parameters.
This includes the output directory, number of epochs, batch size, learning rate,
and evaluation strategy.

from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers

# define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="bge-finetuned",                 # output directory and hugging face model ID
    num_train_epochs=1,                         # number of epochs
    per_device_train_batch_size=4,              # train batch size
    gradient_accumulation_steps=16,             # for a global batch size of 512
    per_device_eval_batch_size=16,              # evaluation batch size
    warmup_ratio=0.1,                           # warmup ratio
    learning_rate=2e-5,                         # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                 # use constant learning rate scheduler
    optim="adamw_torch_fused",                  # use fused adamw optimizer
    tf32=True,                                  # use tf32 precision
    bf16=True,                                  # use bf16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                      # evaluate after each epoch
    save_strategy="epoch",                      # save after each epoch
    logging_steps=10,                           # log every 10 steps
    save_total_limit=3,                         # save only the last 3 models
    load_best_model_at_end=True,                # load the best model when training ends
    metric_for_best_model="eval_dim_128_cosine_ndcg@10",  # Optimizing for the best ndcg@10 score for the 128 dimension
)

NOTE : If you’re working on a Tesla T4 and encounter errors during training, try
commenting out the lines tf32=True and bf16=True to disable TF32 and BF16
precision.


CREATING THE EVALUATOR

We create an evaluator to measure the model’s performance during training. The
evaluator assesses the model’s retrieval performance using
InformationRetrievalEvaluator for each dimension in the Matryoshka loss.

corpus = dict(
    zip(dataset['train']['id'], 
        dataset['train']['positive'])
)  # Our corpus (cid => document)

queries = dict(
    zip(dataset['train']['id'], 
        dataset['train']['anchor'])
)  # Our queries (qid => question)

# Create a mapping of relevant document (1 in our case) for each query
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for q_id in queries:
    relevant_docs[q_id] = [q_id]

matryoshka_evaluators = []
# Iterate over the different dimensions
for dim in matryoshka_dimensions:
    ir_evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name=f"dim_{dim}",
        truncate_dim=dim,  # Truncate the embeddings to a certain dimension
        score_functions={"cosine": cos_sim},
    )
    matryoshka_evaluators.append(ir_evaluator)

# Create a sequential evaluator
evaluator = SequentialEvaluator(matryoshka_evaluators)


EVALUATING THE MODEL BEFORE FINE-TUNING

We evaluate the base model to get a baseline performance before fine-tuning.

results = evaluator(model)

for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print(f"{key}: {results[key]}")


DEFINING THE TRAINER

We create a SentenceTransformerTrainer object, specifying the model, training
arguments, dataset, loss function, and evaluator.

from sentence_transformers import SentenceTransformerTrainer

trainer = SentenceTransformerTrainer(
    model=model, # our embedding model
    args=args,  # training arguments we defined above
    train_dataset=dataset.select_columns(
        ["positive", "anchor"]
    ),
    loss=train_loss, # Matryoshka loss
    evaluator=evaluator, # Sequential Evaluator
)


STARTING FINE-TUNING

The trainer.train() method starts the fine-tuning process, updating the model's
weights using the provided data and loss function.

# start training 
trainer.train()
# save the best model
trainer.save_model()

Once training is done, we save the best-performing model to the specified output
directory.


EVALUATING AFTER FINE-TUNING

Finally, we load the fine-tuned model and evaluate it using the same evaluator
to measure the improvement in performance after fine-tuning.

from sentence_transformers import SentenceTransformer

fine_tuned_model = SentenceTransformer(
    args.output_dir, device="cuda" if torch.cuda.is_available() else "cpu"
)
# Evaluate the model
results = evaluator(fine_tuned_model)

# Print the main score
for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print(f"{key}: {results[key]}")



By fine-tuning an embedding model for your domain, you equip your nlp
applications with a deeper understanding of the specific language and concepts
within that field, this can lead to significant improvements in tasks like
question answering, document retrieval, and text generation.

The techniques discussed in this blog post, such as leveraging mrl and using a
powerful model like bge-base-en, offer a practical path towards building
domain-specific embedding models. while we've focused on the process of
fine-tuning, remember that the quality of your dataset is equally crucial,
carefully curating a dataset that accurately reflects the nuances of your domain
is essential for achieving optimal results.

As the field of nlp continues to advance, we can expect to see even more
powerful embedding models and fine-tuning strategies emerge, by staying informed
and adapting your approach, u can harness the full potential of embedding models
for building high-quality nlp applications tailored to your specific needs.

Happy Tuning .



My LinkedIn : https://www.linkedin.com/in/ayoub-kirouane3

My HuggingFace : https://huggingface.co/ayoubkirouane




SIGN UP TO DISCOVER HUMAN STORIES THAT DEEPEN YOUR UNDERSTANDING OF THE WORLD.


FREE



Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.


Sign up for free


MEMBERSHIP



Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app


Try for 5 $/month
Sentence Transformers
Hugging Face
Embedding
Rage
Large Language Models


556

556

2


Follow



WRITTEN BY KIROUANE AYOUB

251 Followers
·Writer for

GoPenAI

I Like building Machine Learning models from scratch .

Follow




MORE FROM KIROUANE AYOUB AND GOPENAI

kirouane Ayoub

in

GoPenAI


BUILDING LLM AGENTS FROM SCRATCH (PART 1) : HOW TO BUILD A SIMPLE REASONING AND
ACTING AGENT FROM…


THIS BLOG POST KICKS OFF A SERIES DEDICATED TO BUILDING AI AGENTS FROM THE
GROUND UP. IN THIS FIRST INSTALLMENT, WE’LL EXPLORE A BASIC…

Sep 3
73
1



Paras Madan

in

GoPenAI


BUILDING A MULTI PDF RAG CHATBOT: LANGCHAIN, STREAMLIT WITH CODE


TALKING TO BIG PDF’S IS COOL. YOU CAN CHAT WITH YOUR NOTES, BOOKS AND DOCUMENTS
ETC. THIS BLOG POST WILL HELP YOU BUILD A MULTI RAG…

Jun 6
801
6



Paras Madan

in

GoPenAI


ADVANCED RAG FOR DATABASE WITHOUT EXPOSING DB DATA: TEXT TO SQL


DATA PRIVACY IS A BIG CONCERN WHEN IMPLEMENTING RAG SOLUTIONS. COMPANIES ARE NOT
WILLING TO EXPOSE THEIR SENSITIVE DATA TO THESE LLM’S AT…

Jul 18
222
3



kirouane Ayoub

in

GoPenAI


MICROSOFT GRAPHRAG AND OLLAMA: CODE YOUR WAY TO SMARTER QUESTION ANSWERING


TRADITIONAL RAG METHODS, WHICH PRIMARILY RELY ON SEMANTIC SIMILARITY SEARCH,
OFTEN FALL SHORT WHEN FACED WITH COMPLEX QUESTIONS THAT…

Aug 29
217


See all from kirouane Ayoub
See all from GoPenAI



RECOMMENDED FROM MEDIUM

Anoop Maurya

in

Python in Plain English


WHY PYMUPDF4LLM IS THE BEST TOOL FOR EXTRACTING DATA FROM PDFS (EVEN IF YOU
DIDN’T KNOW YOU NEEDED…


STUCK BEHIND A PAYWALL? READ FOR FREE!


Oct 18
1.1K
10



Sebastian Petrus


TOP 10 RAG FRAMEWORKS GITHUB REPOS 2024


RETRIEVAL-AUGMENTED GENERATION (RAG) HAS EMERGED AS A POWERFUL TECHNIQUE FOR
ENHANCING THE CAPABILITIES OF LARGE LANGUAGE MODELS.

Sep 4
398
5




LISTS


NATURAL LANGUAGE PROCESSING

1782 stories·1391 saves


AI REGULATION

6 stories·598 saves


CHATGPT PROMPTS

50 stories·2163 saves


PREDICTIVE MODELING W/ PYTHON

20 stories·1628 saves


Lan Chu

in

Level Up Coding


THE BEST RAG TECHNIQUE YET? ANTHROPIC’S CONTEXTUAL RETRIEVAL AND HYBRID SEARCH


HOW COMBINING CONTEXTUAL BM25 WITH CONTEXTUAL EMBEDDINGS CAN MASSIVELY IMPROVE
YOUR RAG SYSTEM.


Oct 1
1.3K
11



Nir Bar

in

CyberArk Engineering


BUILDING PRODUCTION-READY AI AGENTS WITH LANGGRAPH: A REAL-LIFE USE CASE


A DEVELOPER GUIDE ON BUILDING PRODUCTION-READY AI AGENTS WITH LANGGRAPH WITH A
REAL-LIFE USE CASE

Oct 14
1K
5



Tarun Singh



in

AI Advances


USING PHI-3-VISION-128K FOR REAL-WORLD IMAGE DATA EXTRACTION: FROM INVOICES TO
LANDMARKS


INTRODUCTION


Oct 11
385
1



Pranav Mehta

in

Generative AI


STILL USING THE ‘YOU ARE AN EXPERT… ’ AI PROMPT


11 CREATIVE WAYS I USE AI THAT ARE HONESTLY USEFUL


Oct 12
2.5K
38


See more recommendations

Help

Status

About

Careers

Press

Blog

Privacy

Terms

Text to speech

Teams


To make Medium work, we log user data. By using Medium, you agree to our Privacy
Policy, including cookie policy.