blog.gopenai.com
Open in
urlscan Pro
162.159.153.4
Public Scan
Submitted URL: https://blog.gopenai.com/fine-tuning-dialogpt-medium-on-daily-dialog-dataset-a-step-by-step-guide-4eaecc1b9323
Effective URL: https://blog.gopenai.com/fine-tuning-dialogpt-medium-on-daily-dialog-dataset-a-step-by-step-guide-4eaecc1b9323?gi=6e4cfcf...
Submission: On August 22 via api from US — Scanned from CA
Effective URL: https://blog.gopenai.com/fine-tuning-dialogpt-medium-on-daily-dialog-dataset-a-step-by-step-guide-4eaecc1b9323?gi=6e4cfcf...
Submission: On August 22 via api from US — Scanned from CA
Form analysis
0 forms found in the DOMText Content
Open in app Sign up Sign in Write Sign up Sign in FINE-TUNING DIALOGPT-MEDIUM ON DAILY DIALOG DATASET: A STEP-BY-STEP GUIDE ENHANCING CONVERSATIONAL AI: FINE-TUNING DIALOGPT-MEDIUM ON DAILY DIALOG DATASET Ahmed Ismail Khalid · Follow Published in GoPenAI · 6 min read · May 30, 2023 18 1 Listen Share INTRODUCTION In this article, we will explore how to fine-tune the DialoGPT-Medium model on the Daily Dialog dataset. DialoGPT-Medium is a powerful language model capable of generating human-like responses in a dialogue context. By fine-tuning it on a dialogue dataset like Daily Dialog, we can improve its conversational abilities. We will walk through the code step by step, enabling readers to reproduce the results and gain a clear understanding of the process. So let’s dive in! Note : For this project, Kaggle Notebook is used. You can use Kaggle, Google Colab, Paperspace or even your local environment. All the cloud services allow free, limited access to GPU, however with Kaggle you can get two instances of T4 GPU for 30 hours per week. If you are using Kaggle for this project, you will need to create an account at www.wandb.ai and use your API key to train and/or evaluate the model. If you are using local machine, this project assumes you are not using Windows OS without any Anaconda environment and thus all the packages/libraries are installed using pip for Windows OS STEP 1: DATASET LOADING AND PREPROCESSING The first thing we need to do is to make sure we have all the necessary libraries installed. Depending on which platform/cloud service you use, the required libraries might come pre-installed. However, if you are running the code on your local machine, you will have to make sure that the are installed. In any case, it is always better to install them, in case there are any differences in the pre-installed versions of these libraries. We do this using the code below in the notebook cell !pip install -U transformers !pip install datasets !pip install -U accelerate Note : If you are using your local machine, you can use install the libraries by opening a command prompt and using the above code, without the ! at the start like below pip install -U transformers pip install datasets pip install -U accelerate Next we need to import the installed libraries. We can do so using the code below import numpy as np import tempfile from datasets import load_dataset from transformers import GPT2Tokenizer, GPT2LMHeadModel, TrainingArguments, Trainer Numpy will be used to get the predictions when we will compare the results against the ground truth. Tempfile is going to be used to create a temporary directory for storing the results from the training and deleting the results. This is done to avoid running out of disk space when using and probably isn’t needed when using Google Colab, Paperspace or your local machine. Once the libraries are set up, we load the Daily Dialog dataset using the load_dataset function from the datasets library. This dataset contains dialogues between speakers, and we preprocess it by concatenating all the utterances within a dialogue into a single string. The data is loaded using the code below # Load the DailyDialog dataset dataset = load_dataset('daily_dialog') # Concatenate all utterances within a dialogue and map to 'dialog' key def concatenate_utterances(example): example['dialog'] = " ".join(example['dialog']) return example # Apply the function to all examples in the dataset dataset = dataset.map(concatenate_utterances) STEP 2: MODEL AND TOKENIZER INITIALIZATION Now, we initialize the DialoGPT-Medium tokenizer and model. We use the GPT2Tokenizer and GPT2LMHeadModel classes from the transformers library for this purpose. The tokenizer is responsible for tokenizing the text, while the model performs the language modeling task. # Load the tokenizer and model tokenizer = GPT2Tokenizer.from_pretrained('microsoft/DialoGPT-medium') tokenizer.pad_token = tokenizer.eos_token model = GPT2LMHeadModel.from_pretrained('microsoft/DialoGPT-medium') STEP 3: DATASET ENCODING To prepare the dataset for training, we need to encode the dialogues. We define an encode function that tokenizes the dialogues using the tokenizer. We set appropriate parameters like truncation, padding, and maximum length to handle varying dialogue lengths. Additionally, we create a ‘labels’ key in the encoded data, which contains the same tokenized input as the ‘input_ids’. This will be used for computing the language modeling loss during training. We then apply the encode function to the dataset using the map method, which tokenizes and encodes the dialogues in batches. By passing batched=True, the mapping function processes the dataset in batches for efficiency. The resulting encoded dataset contains the tokenized input dialogues and the corresponding labels. # Encode the dataset def encode(examples): encoded = tokenizer(examples['dialog'], truncation=True, padding='max_length', max_length=128) encoded['labels'] = encoded['input_ids'][:] return encoded encoded_dataset = dataset.map(encode, batched=True) STEP 4: TRAINING SETUP In this step, we define the training arguments using the TrainingArguments class from the transformers library. These arguments include the output directory for saving the trained model, the number of training epochs, batch sizes for training and evaluation, warmup steps for the learning rate scheduler, weight decay for regularization, logging directory, and enabling mixed precision training (fp16) for faster and memory-efficient training. Next, we create a Trainer object with the initialized model, training arguments, and the encoded train and validation datasets. The Trainer class provides a high-level API for training and evaluating models. # Define training arguments training_args = TrainingArguments( output_dir=tempfile.mkdtemp(), # output directory num_train_epochs=10, # total number of training epochs per_device_train_batch_size=16, # batch size per device during training per_device_eval_batch_size=64, # batch size for evaluation warmup_steps=500, # number of warmup steps for learning rate scheduler weight_decay=0.01, # strength of weight decay logging_dir=None, # directory for storing logs fp16=True # use floating point 16 bit precision for training ) # Create Trainer trainer = Trainer( model=model, args=training_args, train_dataset=encoded_dataset['train'], eval_dataset=encoded_dataset['validation'] ) STEP 5: MODEL EVALUATION AND GENERATING PREDICTIONS (BEFORE FINE-TUNING) Before fine-tuning the model, it’s essential to evaluate its performance on the validation dataset. We use the evaluate method of the Trainer to accomplish this. The evaluation results, including the evaluation loss, are stored in the pre_eval_results variable. We only select 10 samples to get predictions on to save disk space and time on getting the predictions. To analyze the model’s initial performance, we generate predictions for the validation dataset before fine-tuning. Using the predict method of the Trainer, we obtain the model’s predictions for a subset of the validation dataset. We decode the predictions using the tokenizer and compare them with the ground truth dialogues. # Evaluate before fine-tuning pre_eval_results = trainer.evaluate(encoded_dataset['validation']) # Get predictions for validation set before fine tuning for 10 samples pre_val_predictions = trainer.predict(encoded_dataset['validation'].select(range(10))) STEP 6: MODEL FINE-TUNING Now it’s time to fine-tune the DialoGPT-Medium model on the Daily Dialog dataset. We train the model using the train method of the Trainer class. The model is trained for the specified number of epochs, utilizing the training arguments we defined earlier. This is very simple, and we an use the train method of the trainer object we created and is done in only one line # Fine-tune the model trainer.train() STEP 8: MODEL EVALUATION AND GENERATING PREDICTIONS (AFTER FINE-TUNING) After fine-tuning the model, we evaluate its performance on the validation dataset once again. This step allows us to assess the improvement achieved through fine-tuning. We use the evaluate method of the Trainer and store the evaluation results in the post_eval_results variable. To observe the impact of fine-tuning on the model’s responses, we generate predictions for the validation dataset after fine-tuning. Similar to the previous step, we use the predict method of the Trainer, decode the predictions, and compare them with the ground truth dialogues. The fine tuning is done, so we will also compare the results before and after the tuning. We will also combine the predictions using the python built-in function zip to iterate over the pre-tuning and post-tuning predictions against the ground truth # Get predictions for validation set before fine tuning for 10 samples pre_val_predictions = trainer.predict(encoded_dataset['validation'].select(range(10))) # Evaluate after fine-tuning post_eval_results = trainer.evaluate(encoded_dataset['validation']) # Print the evaluation losses before and after fine-tuning print('Evaluation Results before fine-tuning :', pre_eval_results['eval_loss']) print('Evaluation Results after fine-tuning :', post_eval_results['eval_loss']) # Get predictions for validation set before fine tuning for 10 samples post_val_predictions = trainer.predict(encoded_dataset['validation'].select(range(10))) # Zip the pre and post tuning predictions predictions = zip(pre_val_predictions.predictions, post_val_predictions.predictions) STEP 10: RESULTS AND CONCLUSION In the final step, we present the model’s predictions before and after fine-tuning, providing a clear understanding of the improvements achieved through the fine-tuning process. for idx, (pre, post) in enumerate(predictions): pre_pred = tokenizer.decode(np.argmax(pre, axis=-1), skip_special_tokens=True) post_pred = tokenizer.decode(np.argmax(post, axis=-1), skip_special_tokens=True) ground_truth = encoded_dataset['validation'][idx]["dialog"] print('Ground truth \n' + ground_truth + '\n') print('Pre-prediction \n' + pre_pred + '\n') print('Post-prediction \n'+ post_pred + '\n') print('----------------------------------------------------------------------------------------------------------------------\n') CONCLUSION: In this article, we explored how to fine-tune the DialoGPT-Medium model on the Daily Dialog dataset. We followed a step-by-step approach, from dataset loading and preprocessing to model initialization, training, and evaluation. By reproducing the code presented here, readers can gain a comprehensive understanding of the process and apply it to their own dialogue generation tasks. Fine-tuning models like DialoGPT-Medium on specific datasets opens up exciting possibilities for enhancing conversational AI systems. SIGN UP TO DISCOVER HUMAN STORIES THAT DEEPEN YOUR UNDERSTANDING OF THE WORLD. FREE Distraction-free reading. No ads. Organize your knowledge with lists and highlights. Tell your story. Find your audience. Sign up for free MEMBERSHIP Read member-only stories Support writers you read most Earn money for your writing Listen to audio narrations Read offline with the Medium app Try for US$5/month NLP Text Generation Dialogpt Daily Dialogue Large Language Models 18 18 1 Follow WRITTEN BY AHMED ISMAIL KHALID 5 Followers ·Writer for GoPenAI Follow MORE FROM AHMED ISMAIL KHALID AND GOPENAI Ahmed Ismail Khalid in GoPenAI PART 1 ─ LAYING THE GROUNDWORK: AN INTRODUCTION TO TRANSFORMERS AND THE ‘ATTENTION IS ALL YOU NEED… DEMYSTIFYING THE BUILDING BLOCKS OF MACHINE LEARNING: A BEGINNER’S GUIDE TO UNDERSTANDING AND IMPLEMENTING TRANSFORMERS AS INTRODUCED IN… Jul 19, 2023 4 1 Elinson in GoPenAI LAB #4: CHAT WITH 10M DATA RECORDS (CHATGPT, PANDASAI AND STREAMLIT) ASK QUESTION ABOUT YOUR DATA IN NATURAL LANGUAGE Jul 16 481 2 Paras Madan in GoPenAI BUILDING A MULTI PDF RAG CHATBOT: LANGCHAIN, STREAMLIT WITH CODE TALKING TO BIG PDF’S IS COOL. YOU CAN CHAT WITH YOUR NOTES, BOOKS AND DOCUMENTS ETC. THIS BLOG POST WILL HELP YOU BUILD A MULTI RAG… Jun 6 582 2 Ahmed Ismail Khalid in GoPenAI PART 2 ─ UNLEASHING THE POWER OF POSITION: POSITIONAL ENCODING IN TRANSFORMERS A PRACTICAL GUIDE TO MASTERING POSITIONAL ENCODING IN TRANSFORMERS USING TENSORFLOW AND PYTORCH Jul 20, 2023 9 See all from Ahmed Ismail Khalid See all from GoPenAI RECOMMENDED FROM MEDIUM Tuan Tran FINE-TUNING LARGE LANGUAGE MODEL WITH HUGGING FACE & PYTORCH USING GPT-2 TO GENERATE COOKING RECIPES Mar 9 42 Vipra Singh LLM ARCHITECTURES EXPLAINED: NLP FUNDAMENTALS (PART 1) DEEP DIVE INTO THE ARCHITECTURE & BUILDING OF REAL-WORLD APPLICATIONS LEVERAGING NLP MODELS STARTING FROM RNN TO THE TRANSFORMERS. 6d ago 947 8 LISTS NATURAL LANGUAGE PROCESSING 1656 stories·1230 saves THE NEW CHATBOTS: CHATGPT, BARD, AND BEYOND 12 stories·444 saves AI REGULATION 6 stories·543 saves CHATGPT PROMPTS 48 stories·1919 saves Shaw Talebi in Towards Data Science HOW TO BUILD AN LLM FROM SCRATCH DATA CURATION, TRANSFORMERS, TRAINING AT SCALE, AND MODEL EVALUATION Sep 21, 2023 1.6K 11 Prashik Hingaspure LLM ROADMAP: A STEP-BY-STEP PROJECT-BASED LLM ROADMAP TO MASTERING LARGE LANGUAGE MODELS PHOTO BY J K ON UNSPLASH May 5 110 1 Christopher Tao in Towards AI DO NOT USE LLM OR GENERATIVE AI FOR THESE USE CASES CHOOSE CORRECT AI TECHNIQUES FOR THE RIGHT USE CASE FAMILIES Aug 10 1.4K 16 alex buzunov in GoPenAI PROMPT FUSION WITH MICROSOFT PHI3 MODELS USING WXPYTHON THE PHI-3 PROMPT FUSER IS AN OPEN-SOURCE DEMO SCRIPT THAT SIMPLIFIES THE PROCESS OF COMBINING AND REFINING PROMPTS FOR AI ART GENERATION… Jul 4 See more recommendations Help Status About Careers Press Blog Privacy Terms Text to speech Teams