blog.gopenai.com Open in urlscan Pro
162.159.153.4 Public Scan

Back to summary
Submitted URL:
https://blog.gopenai.com/fine-tuning-dialogpt-medium-on-daily-dialog-dataset-a-step-by-step-guide-4eaecc1b9323
Effective URL:
https://blog.gopenai.com/fine-tuning-dialogpt-medium-on-daily-dialog-dataset-a-step-by-step-guide-4eaecc1b9323?gi=6e4cfcf...
Submission: On August 22 via api (August 22nd 2024, 3:30:49 am UTC) from US — Scanned from CA
Form analysis
0 forms found in the DOM

Text Content

Open in app

Sign up

Sign in

Write


Sign up

Sign in




FINE-TUNING DIALOGPT-MEDIUM ON DAILY DIALOG DATASET: A STEP-BY-STEP GUIDE


ENHANCING CONVERSATIONAL AI: FINE-TUNING DIALOGPT-MEDIUM ON DAILY DIALOG DATASET

Ahmed Ismail Khalid

·

Follow

Published in

GoPenAI

·
6 min read
·
May 30, 2023

18

1

Listen

Share


INTRODUCTION

In this article, we will explore how to fine-tune the DialoGPT-Medium model on
the Daily Dialog dataset. DialoGPT-Medium is a powerful language model capable
of generating human-like responses in a dialogue context. By fine-tuning it on a
dialogue dataset like Daily Dialog, we can improve its conversational abilities.
We will walk through the code step by step, enabling readers to reproduce the
results and gain a clear understanding of the process. So let’s dive in!

Note : For this project, Kaggle Notebook is used. You can use Kaggle, Google
Colab, Paperspace or even your local environment. All the cloud services allow
free, limited access to GPU, however with Kaggle you can get two instances of T4
GPU for 30 hours per week. If you are using Kaggle for this project, you will
need to create an account at www.wandb.ai and use your API key to train and/or
evaluate the model. If you are using local machine, this project assumes you are
not using Windows OS without any Anaconda environment and thus all the
packages/libraries are installed using pip for Windows OS


STEP 1: DATASET LOADING AND PREPROCESSING

The first thing we need to do is to make sure we have all the necessary
libraries installed. Depending on which platform/cloud service you use, the
required libraries might come pre-installed. However, if you are running the
code on your local machine, you will have to make sure that the are installed.
In any case, it is always better to install them, in case there are any
differences in the pre-installed versions of these libraries. We do this using
the code below in the notebook cell

!pip install -U transformers
!pip install datasets
!pip install -U accelerate

Note : If you are using your local machine, you can use install the libraries by
opening a command prompt and using the above code, without the ! at the start
like below

pip install -U transformers
pip install datasets
pip install -U accelerate

Next we need to import the installed libraries. We can do so using the code
below

import numpy as np
import tempfile
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TrainingArguments, Trainer

Numpy will be used to get the predictions when we will compare the results
against the ground truth. Tempfile is going to be used to create a temporary
directory for storing the results from the training and deleting the results.
This is done to avoid running out of disk space when using and probably isn’t
needed when using Google Colab, Paperspace or your local machine.

Once the libraries are set up, we load the Daily Dialog dataset using the
load_dataset function from the datasets library. This dataset contains dialogues
between speakers, and we preprocess it by concatenating all the utterances
within a dialogue into a single string. The data is loaded using the code below

# Load the DailyDialog dataset
dataset = load_dataset('daily_dialog')

# Concatenate all utterances within a dialogue and map to 'dialog' key
def concatenate_utterances(example):
    example['dialog'] = " ".join(example['dialog'])
    return example

# Apply the function to all examples in the dataset
dataset = dataset.map(concatenate_utterances)


STEP 2: MODEL AND TOKENIZER INITIALIZATION

Now, we initialize the DialoGPT-Medium tokenizer and model. We use the
GPT2Tokenizer and GPT2LMHeadModel classes from the transformers library for this
purpose. The tokenizer is responsible for tokenizing the text, while the model
performs the language modeling task.

# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('microsoft/DialoGPT-medium')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('microsoft/DialoGPT-medium')


STEP 3: DATASET ENCODING

To prepare the dataset for training, we need to encode the dialogues. We define
an encode function that tokenizes the dialogues using the tokenizer. We set
appropriate parameters like truncation, padding, and maximum length to handle
varying dialogue lengths. Additionally, we create a ‘labels’ key in the encoded
data, which contains the same tokenized input as the ‘input_ids’. This will be
used for computing the language modeling loss during training. We then apply the
encode function to the dataset using the map method, which tokenizes and encodes
the dialogues in batches. By passing batched=True, the mapping function
processes the dataset in batches for efficiency. The resulting encoded dataset
contains the tokenized input dialogues and the corresponding labels.

# Encode the dataset
def encode(examples):
    encoded = tokenizer(examples['dialog'], truncation=True, padding='max_length', max_length=128)
    encoded['labels'] = encoded['input_ids'][:]
    return encoded

encoded_dataset = dataset.map(encode, batched=True)


STEP 4: TRAINING SETUP

In this step, we define the training arguments using the TrainingArguments class
from the transformers library. These arguments include the output directory for
saving the trained model, the number of training epochs, batch sizes for
training and evaluation, warmup steps for the learning rate scheduler, weight
decay for regularization, logging directory, and enabling mixed precision
training (fp16) for faster and memory-efficient training. Next, we create a
Trainer object with the initialized model, training arguments, and the encoded
train and validation datasets. The Trainer class provides a high-level API for
training and evaluating models.

# Define training arguments
training_args = TrainingArguments(
    output_dir=tempfile.mkdtemp(),   # output directory
    num_train_epochs=10,             # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir=None,                # directory for storing logs
    fp16=True                        # use floating point 16 bit precision for training
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['validation']
)


STEP 5: MODEL EVALUATION AND GENERATING PREDICTIONS (BEFORE FINE-TUNING)

Before fine-tuning the model, it’s essential to evaluate its performance on the
validation dataset. We use the evaluate method of the Trainer to accomplish
this. The evaluation results, including the evaluation loss, are stored in the
pre_eval_results variable. We only select 10 samples to get predictions on to
save disk space and time on getting the predictions.

To analyze the model’s initial performance, we generate predictions for the
validation dataset before fine-tuning. Using the predict method of the Trainer,
we obtain the model’s predictions for a subset of the validation dataset. We
decode the predictions using the tokenizer and compare them with the ground
truth dialogues.

# Evaluate before fine-tuning
pre_eval_results = trainer.evaluate(encoded_dataset['validation'])

# Get predictions for validation set before fine tuning for 10 samples
pre_val_predictions = trainer.predict(encoded_dataset['validation'].select(range(10)))


STEP 6: MODEL FINE-TUNING

Now it’s time to fine-tune the DialoGPT-Medium model on the Daily Dialog
dataset. We train the model using the train method of the Trainer class. The
model is trained for the specified number of epochs, utilizing the training
arguments we defined earlier. This is very simple, and we an use the train
method of the trainer object we created and is done in only one line

# Fine-tune the model
trainer.train()


STEP 8: MODEL EVALUATION AND GENERATING PREDICTIONS (AFTER FINE-TUNING)

After fine-tuning the model, we evaluate its performance on the validation
dataset once again. This step allows us to assess the improvement achieved
through fine-tuning. We use the evaluate method of the Trainer and store the
evaluation results in the post_eval_results variable.

To observe the impact of fine-tuning on the model’s responses, we generate
predictions for the validation dataset after fine-tuning. Similar to the
previous step, we use the predict method of the Trainer, decode the predictions,
and compare them with the ground truth dialogues. The fine tuning is done, so we
will also compare the results before and after the tuning. We will also combine
the predictions using the python built-in function zip to iterate over the
pre-tuning and post-tuning predictions against the ground truth

# Get predictions for validation set before fine tuning for 10 samples
pre_val_predictions = trainer.predict(encoded_dataset['validation'].select(range(10)))
# Evaluate after fine-tuning
post_eval_results = trainer.evaluate(encoded_dataset['validation'])

# Print the evaluation losses before and after fine-tuning
print('Evaluation Results before fine-tuning :', pre_eval_results['eval_loss'])
print('Evaluation Results after fine-tuning  :', post_eval_results['eval_loss'])

# Get predictions for validation set before fine tuning for 10 samples
post_val_predictions = trainer.predict(encoded_dataset['validation'].select(range(10)))

# Zip the pre and post tuning predictions
predictions = zip(pre_val_predictions.predictions, post_val_predictions.predictions)


STEP 10: RESULTS AND CONCLUSION

In the final step, we present the model’s predictions before and after
fine-tuning, providing a clear understanding of the improvements achieved
through the fine-tuning process.

for idx, (pre, post) in enumerate(predictions):
    pre_pred = tokenizer.decode(np.argmax(pre, axis=-1), skip_special_tokens=True)
    post_pred = tokenizer.decode(np.argmax(post, axis=-1), skip_special_tokens=True)
    ground_truth = encoded_dataset['validation'][idx]["dialog"]
    
    print('Ground truth \n' + ground_truth + '\n')
    print('Pre-prediction \n' + pre_pred + '\n')
    print('Post-prediction \n'+ post_pred + '\n')
    print('----------------------------------------------------------------------------------------------------------------------\n')


CONCLUSION:

In this article, we explored how to fine-tune the DialoGPT-Medium model on the
Daily Dialog dataset. We followed a step-by-step approach, from dataset loading
and preprocessing to model initialization, training, and evaluation. By
reproducing the code presented here, readers can gain a comprehensive
understanding of the process and apply it to their own dialogue generation
tasks. Fine-tuning models like DialoGPT-Medium on specific datasets opens up
exciting possibilities for enhancing conversational AI systems.




SIGN UP TO DISCOVER HUMAN STORIES THAT DEEPEN YOUR UNDERSTANDING OF THE WORLD.


FREE



Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.


Sign up for free


MEMBERSHIP



Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app


Try for US$5/month
NLP
Text Generation
Dialogpt
Daily Dialogue
Large Language Models


18

18

1


Follow



WRITTEN BY AHMED ISMAIL KHALID

5 Followers
·Writer for

GoPenAI


Follow




MORE FROM AHMED ISMAIL KHALID AND GOPENAI

Ahmed Ismail Khalid

in

GoPenAI


PART 1 ─ LAYING THE GROUNDWORK: AN INTRODUCTION TO TRANSFORMERS AND THE
‘ATTENTION IS ALL YOU NEED…


DEMYSTIFYING THE BUILDING BLOCKS OF MACHINE LEARNING: A BEGINNER’S GUIDE TO
UNDERSTANDING AND IMPLEMENTING TRANSFORMERS AS INTRODUCED IN…

Jul 19, 2023
4
1



Elinson

in

GoPenAI


LAB #4: CHAT WITH 10M DATA RECORDS (CHATGPT, PANDASAI AND STREAMLIT)


ASK QUESTION ABOUT YOUR DATA IN NATURAL LANGUAGE


Jul 16
481
2



Paras Madan

in

GoPenAI


BUILDING A MULTI PDF RAG CHATBOT: LANGCHAIN, STREAMLIT WITH CODE


TALKING TO BIG PDF’S IS COOL. YOU CAN CHAT WITH YOUR NOTES, BOOKS AND DOCUMENTS
ETC. THIS BLOG POST WILL HELP YOU BUILD A MULTI RAG…

Jun 6
582
2



Ahmed Ismail Khalid

in

GoPenAI


PART 2 ─ UNLEASHING THE POWER OF POSITION: POSITIONAL ENCODING IN TRANSFORMERS


A PRACTICAL GUIDE TO MASTERING POSITIONAL ENCODING IN TRANSFORMERS USING
TENSORFLOW AND PYTORCH

Jul 20, 2023
9


See all from Ahmed Ismail Khalid
See all from GoPenAI



RECOMMENDED FROM MEDIUM

Tuan Tran


FINE-TUNING LARGE LANGUAGE MODEL WITH HUGGING FACE & PYTORCH


USING GPT-2 TO GENERATE COOKING RECIPES

Mar 9
42



Vipra Singh


LLM ARCHITECTURES EXPLAINED: NLP FUNDAMENTALS (PART 1)


DEEP DIVE INTO THE ARCHITECTURE & BUILDING OF REAL-WORLD APPLICATIONS LEVERAGING
NLP MODELS STARTING FROM RNN TO THE TRANSFORMERS.


6d ago
947
8




LISTS


NATURAL LANGUAGE PROCESSING

1656 stories·1230 saves


THE NEW CHATBOTS: CHATGPT, BARD, AND BEYOND

12 stories·444 saves


AI REGULATION

6 stories·543 saves


CHATGPT PROMPTS

48 stories·1919 saves


Shaw Talebi

in

Towards Data Science


HOW TO BUILD AN LLM FROM SCRATCH


DATA CURATION, TRANSFORMERS, TRAINING AT SCALE, AND MODEL EVALUATION


Sep 21, 2023
1.6K
11



Prashik Hingaspure


LLM ROADMAP: A STEP-BY-STEP PROJECT-BASED LLM ROADMAP TO MASTERING LARGE
LANGUAGE MODELS


PHOTO BY J K ON UNSPLASH


May 5
110
1



Christopher Tao

in

Towards AI


DO NOT USE LLM OR GENERATIVE AI FOR THESE USE CASES


CHOOSE CORRECT AI TECHNIQUES FOR THE RIGHT USE CASE FAMILIES


Aug 10
1.4K
16



alex buzunov

in

GoPenAI


PROMPT FUSION WITH MICROSOFT PHI3 MODELS USING WXPYTHON


THE PHI-3 PROMPT FUSER IS AN OPEN-SOURCE DEMO SCRIPT THAT SIMPLIFIES THE PROCESS
OF COMBINING AND REFINING PROMPTS FOR AI ART GENERATION…


Jul 4



See more recommendations

Help

Status

About

Careers

Press

Blog

Privacy

Terms

Text to speech

Teams
blog.gopenai.com Open in urlscan Pro 162.159.153.4 Public Scan

Form analysis 0 forms found in the DOM

Text Content

blog.gopenai.com Open in urlscan Pro
162.159.153.4 Public Scan

Form analysis
0 forms found in the DOM