blog.gopenai.com
Open in
urlscan Pro
162.159.153.4
Public Scan
Submitted URL: https://blog.gopenai.com/fine-tuning-embeddings-for-specific-domains-a-comprehensive-guide-5e4298b42185?source=email-547b...
Effective URL: https://blog.gopenai.com/fine-tuning-embeddings-for-specific-domains-a-comprehensive-guide-5e4298b42185?gi=a53a20f1a6cd&s...
Submission: On November 01 via api from US — Scanned from DE
Effective URL: https://blog.gopenai.com/fine-tuning-embeddings-for-specific-domains-a-comprehensive-guide-5e4298b42185?gi=a53a20f1a6cd&s...
Submission: On November 01 via api from US — Scanned from DE
Form analysis
0 forms found in the DOMText Content
Open in app Sign up Sign in Write Sign up Sign in Top highlight FINE-TUNING EMBEDDINGS FOR SPECIFIC DOMAINS: A COMPREHENSIVE GUIDE kirouane Ayoub · Follow Published in GoPenAI · 11 min read · Sep 30, 2024 556 2 Listen Share Imagine you’re building a question answering system for a medical domain. You want to ensure it can accurately retrieve relevant medical articles when a user asks a question. But generic embedding models might struggle with the highly specialized vocabulary and nuances of medical terminology. That’s where fine-tuning comes in !! In this blog post, we’ll delve into the process of fine-tuning an embedding model for a specific domain, like medicine, law, or finance. We’ll generate a dataset specifically for your domain and use it to train the model to better understand the subtle language patterns and concepts within your chosen field. > By the end, you’ll have a more powerful embedding model that’s optimized for > your domain, enabling more accurate retrieval and improved results for your > NLP tasks. EMBEDDINGS: UNDERSTANDING THE CONCEPT Embeddings are powerful numerical representations of text or image that capture semantic relationships. Imagine a text or audio as a point in a multi-dimensional space, where similar words or phrases are located closer together than dissimilar ones. Embeddings are essential for many NLP tasks like : Semantic Similarity: Finding how similar two pieces of images or text are. Text Classification: Grouping your data into categories based on their meaning. Question Answering: Finding the most relevant document to answer a question. Retrieval Augmented Generation (RAG): Combining an embedding model for retrieval and a language model for text generation to improve the quality and relevance of generated text. MATRYOSHKA REPRESENTATION LEARNING Matryoshka Representation Learning (MRL) is a technique for creating “truncatable” embedding vectors. Imagine a series of nested dolls, with each doll containing a smaller one inside. MRL embeds text in a way that the earlier dimensions (like the outer dolls) contain the most important information, and subsequent dimensions add detail. This allows you to use only a portion of the embedding vector when needed, reducing storage and computation costs. BGE-BASE-EN The BAAI/bge-base-en-v1.5 model, developed by BAAI (Beijing Academy of Artificial Intelligence), is a powerful text embedding model. It excels at various NLP tasks and has been shown to perform well on benchmarks like MTEB and C-MTEB. The bge-base-en model is a good choice for applications with limited computing resources (like my case). WHY FINE-TUNE EMBEDDINGS ? Fine-tuning an embedding model for a specific domain is crucial for optimizing RAG systems. This process ensures that the model’s understanding of similarity aligns with the specific context and language nuances of your domain. A fine-tuned embedding model is better equipped to retrieve the most relevant documents for a question, ultimately leading to more accurate and relevant responses from your RAG system. DATASET FORMATS: BUILDING THE FOUNDATION FOR FINE-TUNING You can use various dataset formats for fine-tuning. Here are the most common types: * Positive Pair: A pair of related sentences (e.g.,questions , answers) . * Triplets: (anchor, positive, negative) triplets, where the anchor is similar to the positive and dissimilar to the negative. * Pair with Similarity Score: A pair of sentences with a similarity score indicating their relationship. * Texts with Classes: A text with its corresponding class label. In this blog post, we will create a dataset of questions , answers pairs to fine-tune our bge-base-en-v1.5 model. LOSS FUNCTIONS: GUIDING THE TRAINING PROCESS Loss functions are crucial for training embedding models. They measure the discrepancy between the model’s predictions and the actual labels, providing a signal for the model to adjust its weights. Different loss functions are suitable for different dataset formats: * Triplet Loss: Used with (anchor, positive, negative) triplets to encourage the model to place similar sentences closer together and dissimilar sentences farther apart. * Contrastive Loss: Used with positive and negative pairs, encouraging similar sentences to be close and dissimilar sentences to be distant. * Cosine Similarity Loss: Used with pairs of sentences and a similarity score, encouraging the model to produce embeddings with cosine similarities that match the provided scores. * Matryoshka Loss: A specialized loss function designed to create Matryoshka embeddings, where the embeddings are truncatable. CODE EXAMPLE INSTALLING DEPENDENCIES We start with installing essential libraries. We’ll use datasets, sentence-transformers, and google-generativeai for handling datasets, embedding models, and text generation. apt-get -qq install poppler-utils tesseract-ocr pip install datasets sentence-transformers google-generativeai pip install -q --user --upgrade pillow pip install -q unstructured["all-docs"] pi_heif pip install -q --upgrade unstructured pip install --upgrade nltk We’ll also install unstructured for PDF parsing and nltk for text processing. PDF PARSING AND TEXT EXTRACTION We’ll use the unstructured library to extract text and tables from PDF files. import nltk import os from unstructured.partition.pdf import partition_pdf from collections import Counter nltk.download('punkt') nltk.download('averaged_perceptron_tagger') nltk.download('punkt_tab') def process_pdfs_in_folder(folder_path): total_text = [] # To accumulate the text from all PDFs # Get list of all PDF files in the folder pdf_files = [f for f in os.listdir(folder_path) if f.endswith('.pdf')] for pdf_file in pdf_files: pdf_path = os.path.join(folder_path, pdf_file) print(f"Processing: {pdf_path}") # Apply the partition logic elements = partition_pdf(pdf_path, strategy="auto") # Display the types of elements display(Counter(type(element) for element in elements)) # Join the elements to form text and add it to total_text list text = "\n\n".join([str(el) for el in elements]) total_text.append(text) # Return the total concatenated text return "\n\n".join(total_text) folder_path = "data" all_text = process_pdfs_in_folder(folder_path) We go through each PDF in a specified folder and partition the content into text, tables, and figures. We then combine the text elements into a single text representation. CUSTOM TEXT CHUNKING we break now the extracted text into manageable chunks using nltk. This is essential for making the text more suitable for processing by the llm. import nltk nltk.download('punkt') def nltk_based_splitter(text: str, chunk_size: int, overlap: int) -> list: """ Splits the input text into chunks of a specified size, with optional overlap between chunks. Parameters: - text: The input text to be split. - chunk_size: The maximum size of each chunk (in terms of characters). - overlap: The number of overlapping characters between consecutive chunks. Returns: - A list of text chunks, with or without overlap. """ from nltk.tokenize import sent_tokenize # Tokenize the input text into individual sentences sentences = sent_tokenize(text) chunks = [] current_chunk = "" for sentence in sentences: # If the current chunk plus the next sentence doesn't exceed the chunk size, add the sentence to the chunk if len(current_chunk) + len(sentence) <= chunk_size: current_chunk += " " + sentence else: # Otherwise, add the current chunk to the list of chunks and start a new chunk with the current sentence chunks.append(current_chunk.strip()) # Strip to remove leading spaces current_chunk = sentence # After the loop, if there is any leftover text in the current chunk, add it to the list of chunks if current_chunk: chunks.append(current_chunk.strip()) # Handle overlap if it's specified (overlap > 0) if overlap > 0: overlapping_chunks = [] for i in range(len(chunks)): if i > 0: # Calculate the start index for overlap from the previous chunk start_overlap = max(0, len(chunks[i-1]) - overlap) # Combine the overlapping portion of the previous chunk with the current chunk chunk_with_overlap = chunks[i-1][start_overlap:] + " " + chunks[i] # Append the combined chunk, making sure it's not longer than chunk_size overlapping_chunks.append(chunk_with_overlap[:chunk_size]) else: # For the first chunk, there's no previous chunk to overlap with overlapping_chunks.append(chunks[i][:chunk_size]) return overlapping_chunks # Return the list of chunks with overlap # If overlap is 0, return the non-overlapping chunks return chunks chunks = nltk_based_splitter(text=all_text, chunk_size=2048, overlap=0) DATASET GENERATOR In this section we define two functions: The prompt function creates a prompt for Google Gemini, requesting a Question-Answer pair based on a provided text chunk. import google.generativeai as genai import pandas as pd # Replace with your valid Google API key GOOGLE_API_KEY = "xxxxxxxxxxxx" # Prompt generator with an explicit request for structured output def prompt(text_chunk): return f""" Based on the following text, generate one Question and its corresponding Answer. Please format the output as follows: Question: [Your question] Answer: [Your answer] Text: {text_chunk} """ # Function to interact with Google's Gemini and return a QA pair def generate_with_gemini(text_chunk:str, temperature:float, model_name:str): genai.configure(api_key=GOOGLE_API_KEY) generation_config = {"temperature": temperature} # Initialize the generative model gen_model = genai.GenerativeModel(model_name, generation_config=generation_config) # Generate response based on the prompt response = gen_model.generate_content(prompt(text_chunk)) # Extract question and answer from response using keyword try: question, answer = response.text.split("Answer:", 1) question = question.replace("Question:", "").strip() answer = answer.strip() except ValueError: question, answer = "N/A", "N/A" # Handle unexpected format in response return question, answer The generate_with_gemini function interacts with the Gemini model and generates a QA pair using the created prompt. RUNNING Q&A GENERATION Using the process_text_chunks function, we generate QA pairs for each text chunk using the Gemini model. def process_text_chunks(text_chunks:list, temperature:int, model_name=str): """ Processes a list of text chunks to generate questions and answers using a specified model. Parameters: - text_chunks: A list of text chunks to process. - temperature: The sampling temperature to control randomness in the generated outputs. - model_name: The name of the model to use for generating questions and answers. Returns: - A Pandas DataFrame containing the text chunks, questions, and answers. """ results = [] # Iterate through each text chunk for chunk in text_chunks: question, answer = generate_with_gemini(chunk, temperature, model_name) results.append({"Text Chunk": chunk, "Question": question, "Answer": answer}) # Convert results into a Pandas DataFrame df = pd.DataFrame(results) return df # Process the text chunks and get the DataFrame df_results = process_text_chunks(text_chunks=chunks, temperature=0.7, model_name="gemini-1.5-flash") df_results.to_csv("generated_qa_pairs.csv", index=False) These results are then stored in a Pandas DataFrame. LOADING THE DATASET Next, we load the generated QA pairs from the CSV file into a HuggingFace dataset. We make sure the data is in the correct format for fine-tuning. from datasets import load_dataset # Load the CSV file into a Hugging Face Dataset dataset = load_dataset('csv', data_files='generated_qa_pairs.csv') def process_example(example, idx): return { "id": idx, # Add unique ID based on the index "anchor": example["Question"], "positive": example["Answer"] } dataset = dataset.map(process_example, with_indices=True , remove_columns=["Text Chunk", "Question", "Answer"]) LOADING THE MODEL We load the BAAI/bge-base-en-v1.5 model from HuggingFace, making sure to choose the appropriate device for execution (CPU or GPU). import torch from sentence_transformers import SentenceTransformer from sentence_transformers.evaluation import ( InformationRetrievalEvaluator, SequentialEvaluator, ) from sentence_transformers.util import cos_sim from datasets import load_dataset, concatenate_datasets from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss model_id = "BAAI/bge-base-en-v1.5" # Load a model model = SentenceTransformer( model_id, device="cuda" if torch.cuda.is_available() else "cpu" ) DEFINING THE LOSS FUNCTION Here, we configure the Matryoshka loss function, specifying the dimensions to be used for the truncated embeddings. # Important: large to small matryoshka_dimensions = [768, 512, 256, 128, 64] inner_train_loss = MultipleNegativesRankingLoss(model) train_loss = MatryoshkaLoss( model, inner_train_loss, matryoshka_dims=matryoshka_dimensions ) The inner loss function, MultipleNegativesRankingLoss, helps the model produce embeddings suitable for retrieval tasks. DEFINING TRAINING ARGUMENTS We use SentenceTransformerTrainingArguments to define the training parameters. This includes the output directory, number of epochs, batch size, learning rate, and evaluation strategy. from sentence_transformers import SentenceTransformerTrainingArguments from sentence_transformers.training_args import BatchSamplers # define training arguments args = SentenceTransformerTrainingArguments( output_dir="bge-finetuned", # output directory and hugging face model ID num_train_epochs=1, # number of epochs per_device_train_batch_size=4, # train batch size gradient_accumulation_steps=16, # for a global batch size of 512 per_device_eval_batch_size=16, # evaluation batch size warmup_ratio=0.1, # warmup ratio learning_rate=2e-5, # learning rate, 2e-5 is a good value lr_scheduler_type="cosine", # use constant learning rate scheduler optim="adamw_torch_fused", # use fused adamw optimizer tf32=True, # use tf32 precision bf16=True, # use bf16 precision batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch eval_strategy="epoch", # evaluate after each epoch save_strategy="epoch", # save after each epoch logging_steps=10, # log every 10 steps save_total_limit=3, # save only the last 3 models load_best_model_at_end=True, # load the best model when training ends metric_for_best_model="eval_dim_128_cosine_ndcg@10", # Optimizing for the best ndcg@10 score for the 128 dimension ) NOTE : If you’re working on a Tesla T4 and encounter errors during training, try commenting out the lines tf32=True and bf16=True to disable TF32 and BF16 precision. CREATING THE EVALUATOR We create an evaluator to measure the model’s performance during training. The evaluator assesses the model’s retrieval performance using InformationRetrievalEvaluator for each dimension in the Matryoshka loss. corpus = dict( zip(dataset['train']['id'], dataset['train']['positive']) ) # Our corpus (cid => document) queries = dict( zip(dataset['train']['id'], dataset['train']['anchor']) ) # Our queries (qid => question) # Create a mapping of relevant document (1 in our case) for each query relevant_docs = {} # Query ID to relevant documents (qid => set([relevant_cids]) for q_id in queries: relevant_docs[q_id] = [q_id] matryoshka_evaluators = [] # Iterate over the different dimensions for dim in matryoshka_dimensions: ir_evaluator = InformationRetrievalEvaluator( queries=queries, corpus=corpus, relevant_docs=relevant_docs, name=f"dim_{dim}", truncate_dim=dim, # Truncate the embeddings to a certain dimension score_functions={"cosine": cos_sim}, ) matryoshka_evaluators.append(ir_evaluator) # Create a sequential evaluator evaluator = SequentialEvaluator(matryoshka_evaluators) EVALUATING THE MODEL BEFORE FINE-TUNING We evaluate the base model to get a baseline performance before fine-tuning. results = evaluator(model) for dim in matryoshka_dimensions: key = f"dim_{dim}_cosine_ndcg@10" print(f"{key}: {results[key]}") DEFINING THE TRAINER We create a SentenceTransformerTrainer object, specifying the model, training arguments, dataset, loss function, and evaluator. from sentence_transformers import SentenceTransformerTrainer trainer = SentenceTransformerTrainer( model=model, # our embedding model args=args, # training arguments we defined above train_dataset=dataset.select_columns( ["positive", "anchor"] ), loss=train_loss, # Matryoshka loss evaluator=evaluator, # Sequential Evaluator ) STARTING FINE-TUNING The trainer.train() method starts the fine-tuning process, updating the model's weights using the provided data and loss function. # start training trainer.train() # save the best model trainer.save_model() Once training is done, we save the best-performing model to the specified output directory. EVALUATING AFTER FINE-TUNING Finally, we load the fine-tuned model and evaluate it using the same evaluator to measure the improvement in performance after fine-tuning. from sentence_transformers import SentenceTransformer fine_tuned_model = SentenceTransformer( args.output_dir, device="cuda" if torch.cuda.is_available() else "cpu" ) # Evaluate the model results = evaluator(fine_tuned_model) # Print the main score for dim in matryoshka_dimensions: key = f"dim_{dim}_cosine_ndcg@10" print(f"{key}: {results[key]}") By fine-tuning an embedding model for your domain, you equip your nlp applications with a deeper understanding of the specific language and concepts within that field, this can lead to significant improvements in tasks like question answering, document retrieval, and text generation. The techniques discussed in this blog post, such as leveraging mrl and using a powerful model like bge-base-en, offer a practical path towards building domain-specific embedding models. while we've focused on the process of fine-tuning, remember that the quality of your dataset is equally crucial, carefully curating a dataset that accurately reflects the nuances of your domain is essential for achieving optimal results. As the field of nlp continues to advance, we can expect to see even more powerful embedding models and fine-tuning strategies emerge, by staying informed and adapting your approach, u can harness the full potential of embedding models for building high-quality nlp applications tailored to your specific needs. Happy Tuning . My LinkedIn : https://www.linkedin.com/in/ayoub-kirouane3 My HuggingFace : https://huggingface.co/ayoubkirouane SIGN UP TO DISCOVER HUMAN STORIES THAT DEEPEN YOUR UNDERSTANDING OF THE WORLD. FREE Distraction-free reading. No ads. Organize your knowledge with lists and highlights. Tell your story. Find your audience. Sign up for free MEMBERSHIP Read member-only stories Support writers you read most Earn money for your writing Listen to audio narrations Read offline with the Medium app Try for 5 $/month Sentence Transformers Hugging Face Embedding Rage Large Language Models 556 556 2 Follow WRITTEN BY KIROUANE AYOUB 251 Followers ·Writer for GoPenAI I Like building Machine Learning models from scratch . Follow MORE FROM KIROUANE AYOUB AND GOPENAI kirouane Ayoub in GoPenAI BUILDING LLM AGENTS FROM SCRATCH (PART 1) : HOW TO BUILD A SIMPLE REASONING AND ACTING AGENT FROM… THIS BLOG POST KICKS OFF A SERIES DEDICATED TO BUILDING AI AGENTS FROM THE GROUND UP. IN THIS FIRST INSTALLMENT, WE’LL EXPLORE A BASIC… Sep 3 73 1 Paras Madan in GoPenAI BUILDING A MULTI PDF RAG CHATBOT: LANGCHAIN, STREAMLIT WITH CODE TALKING TO BIG PDF’S IS COOL. YOU CAN CHAT WITH YOUR NOTES, BOOKS AND DOCUMENTS ETC. THIS BLOG POST WILL HELP YOU BUILD A MULTI RAG… Jun 6 801 6 Paras Madan in GoPenAI ADVANCED RAG FOR DATABASE WITHOUT EXPOSING DB DATA: TEXT TO SQL DATA PRIVACY IS A BIG CONCERN WHEN IMPLEMENTING RAG SOLUTIONS. COMPANIES ARE NOT WILLING TO EXPOSE THEIR SENSITIVE DATA TO THESE LLM’S AT… Jul 18 222 3 kirouane Ayoub in GoPenAI MICROSOFT GRAPHRAG AND OLLAMA: CODE YOUR WAY TO SMARTER QUESTION ANSWERING TRADITIONAL RAG METHODS, WHICH PRIMARILY RELY ON SEMANTIC SIMILARITY SEARCH, OFTEN FALL SHORT WHEN FACED WITH COMPLEX QUESTIONS THAT… Aug 29 217 See all from kirouane Ayoub See all from GoPenAI RECOMMENDED FROM MEDIUM Anoop Maurya in Python in Plain English WHY PYMUPDF4LLM IS THE BEST TOOL FOR EXTRACTING DATA FROM PDFS (EVEN IF YOU DIDN’T KNOW YOU NEEDED… STUCK BEHIND A PAYWALL? READ FOR FREE! Oct 18 1.1K 10 Sebastian Petrus TOP 10 RAG FRAMEWORKS GITHUB REPOS 2024 RETRIEVAL-AUGMENTED GENERATION (RAG) HAS EMERGED AS A POWERFUL TECHNIQUE FOR ENHANCING THE CAPABILITIES OF LARGE LANGUAGE MODELS. Sep 4 398 5 LISTS NATURAL LANGUAGE PROCESSING 1782 stories·1391 saves AI REGULATION 6 stories·598 saves CHATGPT PROMPTS 50 stories·2163 saves PREDICTIVE MODELING W/ PYTHON 20 stories·1628 saves Lan Chu in Level Up Coding THE BEST RAG TECHNIQUE YET? ANTHROPIC’S CONTEXTUAL RETRIEVAL AND HYBRID SEARCH HOW COMBINING CONTEXTUAL BM25 WITH CONTEXTUAL EMBEDDINGS CAN MASSIVELY IMPROVE YOUR RAG SYSTEM. Oct 1 1.3K 11 Nir Bar in CyberArk Engineering BUILDING PRODUCTION-READY AI AGENTS WITH LANGGRAPH: A REAL-LIFE USE CASE A DEVELOPER GUIDE ON BUILDING PRODUCTION-READY AI AGENTS WITH LANGGRAPH WITH A REAL-LIFE USE CASE Oct 14 1K 5 Tarun Singh in AI Advances USING PHI-3-VISION-128K FOR REAL-WORLD IMAGE DATA EXTRACTION: FROM INVOICES TO LANDMARKS INTRODUCTION Oct 11 385 1 Pranav Mehta in Generative AI STILL USING THE ‘YOU ARE AN EXPERT… ’ AI PROMPT 11 CREATIVE WAYS I USE AI THAT ARE HONESTLY USEFUL Oct 12 2.5K 38 See more recommendations Help Status About Careers Press Blog Privacy Terms Text to speech Teams To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.