blog.gopenai.com Open in urlscan Pro
162.159.152.4  Public Scan

Submitted URL: http://blog.gopenai.com/paper-review-llama-2-open-foundation-and-fine-tuned-chat-models-23e539522acb
Effective URL: https://blog.gopenai.com/paper-review-llama-2-open-foundation-and-fine-tuned-chat-models-23e539522acb?gi=fd3d09210ae5
Submission: On February 03 via api from US — Scanned from US

Form analysis 0 forms found in the DOM

Text Content

Open in app

Sign up

Sign in

Write


Sign up

Sign in



Top highlight


PAPER REVIEW


PAPER REVIEW: LLAMA 2: OPEN FOUNDATION AND FINE-TUNED CHAT MODELS


LLAMA 2: ONE OF THE BEST OPEN SOURCE MODELS

Andrew Lukyanenko

·

Follow

Published in

GoPenAI

·
15 min read
·
Jul 19, 2023

261



Listen

Share

Project link

Model link

Paper link



The authors of the work present Llama 2, an assortment of pretrained and
fine-tuned large language models (LLMs) with sizes varying from 7 billion to 70
billion parameters. The fine-tuned versions, named Llama 2-Chat, are
specifically designed for dialogue applications. These models surpass the
performance of existing open-source chat models on most benchmarks, and
according to human evaluations for usefulness and safety, they could potentially
replace closed-source models. The authors also detail their approach to
fine-tuning and safety enhancements for Llama 2-Chat to support the community in
further developing and responsibly handling LLMs.


PRETRAINING



The authors developed the Llama 2 model family starting from the pretraining
methodology of Llama, which utilizes an optimized auto-regressive transformer.
They implemented several modifications for improved performance, including
enhanced data cleaning, updated data mixes, training on 40% more total tokens,
and doubling the context length. They also incorporated grouped-query attention
(GQA) to enhance the inference scalability for their larger models.


PRETRAINING DATA

The authors utilized a novel mix of data from publicly accessible sources to
train the Llama 2 models, excluding any data from Meta’s products or services.
They made efforts to erase data from certain sites known for harboring large
amounts of personal information about private individuals. They trained the
models on 2 trillion tokens of data, believing this amount provided a beneficial
performance-cost balance. They also up-sampled the most factual sources to boost
knowledge and reduce instances of false information generation or
“hallucinations”.


LLAMA 2 PRETRAINED MODEL EVALUATION



Llama 2 models significantly outperform their Llama 1 counterparts:

 * The 70 billion-parameter Llama 2 model notably improves results on the MMLU
   and BBH benchmarks by roughly 5 and 8 points, respectively, when compared to
   the 65 billion-parameter Llama 1 model.
 * Llama 2 models with 7 billion and 30 billion parameters outdo MPT models of
   similar size in all categories except code benchmarks.
 * In comparison with Falcon models, Llama 2’s 7 billion and 34 billion
   parameter models outperform the 7 billion and 40 billion parameter Falcon
   models in all benchmark categories.
 * Moreover, the Llama 2 70B model surpasses all open-source models.

Comparatively, the Llama 2 70B model performs similarly to the closed-source
GPT-3.5 (OpenAI, 2023) on the MMLU and GSM8K benchmarks but shows a significant
deficit on coding benchmarks. It matches or exceeds the performance of PaLM (540
billion parameters) on nearly all benchmarks. However, there remains a
substantial performance gap between the Llama 2 70B model and both GPT-4 and
PaLM-2-L.


FINE-TUNING


SUPERVISED FINE-TUNING (SFT)



The authors initiated the Supervised Fine-Tuning (SFT) phase using publicly
available instruction tuning data like in Llama. However, they observed that
many third-party SFT data sources lacked diversity and quality, particularly for
aligning Large Language Models (LLMs) towards dialogue-style instructions.
Therefore, they prioritized collecting several thousand high-quality SFT data
examples, and found that using fewer but better-quality examples led to notable
performance improvements.

The authors discovered that tens of thousands of SFT annotations were enough to
achieve high-quality results, and ceased after collecting 27,540 annotations.
They highlighted the significant impact of different annotation platforms and
vendors on model performance, emphasizing the need for data checks even when
sourcing annotations from vendors. A manual examination of a set of 180 examples
showed that model outputs were often competitive with those handwritten by human
annotators, suggesting the value in shifting more annotation efforts to
preference-based annotation for Reinforcement Learning from Human Feedback
(RLHF).

In fine-tuning, each sample consisted of a prompt and an answer, concatenated
together with a special token used to separate the segments. The authors used an
autoregressive objective and zeroed-out the loss on tokens from the user prompt,
meaning they only backpropagated on answer tokens.


HUMAN PREFERENCE DATA COLLECTION



The authors used a binary comparison protocol to collect human preference data
for reward modeling in order to maximize the diversity of prompts. Annotators
were tasked with writing a prompt and choosing between two model responses based
on set criteria. These responses were sampled from two different model variants
and varied by temperature hyperparameter. Annotators also had to rate their
preference for the chosen response over the alternative (significantly better,
better, slightly better, or negligibly better/ unsure.).

The focus of these preference annotations was on “helpfulness” and “safety”. The
authors define “helpfulness” as how well Llama 2-Chat responses fulfil users’
requests, and “safety” as whether responses comply with their safety guidelines.
Separate guidelines were provided for each focus area.

During the safety stage, model responses were categorized into three groups: 1)
the preferred response is safe and the other is not, 2) both responses are safe,
and 3) both responses are unsafe. No examples where the chosen response was
unsafe and the other safe were included.

Human annotations were collected in weekly batches. As more preference data was
collected, their reward models improved, allowing the authors to train
progressively better versions of Llama 2-Chat. As improvements shifted the
model’s data distribution, the authors collected new preference data using the
latest Llama 2-Chat iterations to keep the reward model up-to-date and accurate.

The authors collected over 1 million binary comparisons, referred to as Meta
reward modeling data. Compared to existing open-source datasets, their
preference data has more conversation turns and is, on average, longer.


REWARD MODELING

The authors developed a reward model that inputs a model response and
corresponding prompt and outputs a score indicating the quality (e.g.,
helpfulness, safety) of the generated response. These scores can then be used as
rewards to optimize the Llama 2-Chat model for better alignment with human
preferences.

They trained two separate reward models: one optimized for helpfulness
(Helpfulness RM) and another for safety (Safety RM). These models were
initialized from pre-trained chat model checkpoints, ensuring knowledge transfer
and preventing discrepancies such as favoring hallucinations.

To train the reward model, pairwise human preference data was converted into a
binary ranking label format. The model was trained to ensure that the chosen
response scored higher than its counterpart. To leverage the four-point
preference rating scale, they added a margin component to the loss to help the
model assign more distinct scores to responses that have more differences.

The authors combined their new data with existing open-source preference
datasets to form a larger training dataset.


REWARD MODEL RESULTS



The authors evaluated their reward models on a test set held out from each batch
of human preference annotation data. They compared their models with publicly
available alternatives, including SteamSHP-XL, the Open Assistant reward model,
and GPT-4. Their models performed the best, particularly on the corresponding
internal test sets.

The authors noted a tension between the goals of helpfulness and safety, and
suggested this might be why their separate models performed best on their own
domain. A single model that aims to perform well on both aspects would need to
differentiate between better responses and distinguish safe prompts from
adversarial ones.

When scoring by preference rating, accuracy was superior for “significantly
better” test sets, and it degraded as comparisons became more similar. The
authors pointed out that accuracy on more distinct responses is key to improving
Llama 2-Chat’s performance.

In terms of scaling trends, the authors found that larger models provided better
performance for similar volumes of data, and performance had not plateaued with
the current volume of annotated data. The authors concluded that improving the
reward model’s accuracy could directly improve Llama 2-Chat’s performance, as
the ranking task of the reward is unambiguous.


ITERATIVE FINE-TUNING



Two main algorithms were used for RLHF fine-tuning: Proximal Policy Optimization
(PPO), standard in RLHF literature, and Rejection Sampling fine-tuning, where
they selected the best output from sampled model responses for a gradient
update. The differences between the two algorithms lie in breadth (K samples for
a given prompt in Rejection Sampling) and depth (in PPO the sample is a function
of the updated model policy from the previous step, in RS all outputs are
sampled. In iterative training the differences are less pronounced).

Rejection Sampling fine-tuning was performed only with the largest 70B Llama
2-Chat, with smaller models fine-tuned on the sampled data from the larger
model. Over iterations, the authors adjusted their strategy to include
top-performing samples from all prior iterations, leading to significant
performance improvements.

The authors illustrate the benefit of Rejection Sampling in two ways. They show
the delta between the maximum and median curves can be interpreted as the
potential gain of fine-tuning on the best output. They also found that the
optimal temperature for generating diverse samples isn’t constant during
iterative model updates.

After RLHF (V4), the authors sequentially combined Rejection Sampling and PPO
fine-tuning. For PPO, they iteratively improved the policy by sampling prompts
and generations from the policy and used the PPO algorithm to achieve the
objective. They also added a penalty term for diverging from the original
policy, as it’s helpful for training stability and to reduce reward hacking.


SYSTEM MESSAGE FOR MULTI-TURN CONSISTENCY



The authors proposed Ghost Attention (GAtt), a technique designed to help AI
remember initial instructions throughout a dialogue. This method, which builds
on the concept of Context Distillation, introduces an instruction that needs to
be followed throughout the conversation and is appended to all user messages in
a synthetic dialogue dataset. During training, the instruction is only kept in
the first turn and the loss is set to zero for all tokens from prior turns. This
strategy was applied to a range of synthetic constraints including hobbies,
language, and public figures. The implementation of GAtt helped maintain
attention towards initial instructions over a larger part of the dialogue.

GAtt managed to ensure consistency even over 20+ turns, until the maximum
context length was reached. While this initial implementation has been
beneficial, the authors believe there is potential for further enhancement and
iteration on this technique.


RLHF RESULTS

Model-Based Evaluation



Evaluating large language models (LLMs) like Llama 2-Chat is a complex problem.
While human evaluation is considered the gold standard, it is not always
scalable and may present complications. As a solution, the authors first used
reward models to measure improvement in iterations of their Reinforcement
Learning from Human Feedback (RLHF) model versions, and later confirmed these
findings with human evaluations.

To test the reward model’s reliability, the authors collected a test set of
prompts and had them judged by human annotators. The results indicated that the
reward models were generally well-aligned with human preferences, validating
their use as a point-wise metric.

However, to prevent possible divergence from human preferences, the authors also
utilized a more general reward model trained on diverse open-source Reward
Modeling datasets. They hypothesize that iterative model updates may help
maintain alignment with human preferences.

In a final check to ensure no regression between new and old models, both models
were used in the next annotation iteration for comparison.

The authors’ models were shown to outperform ChatGPT in both safety and
helpfulness after RLHF-V3. For fair comparison, final results were also assessed
using GPT-4. This resulted in Llama 2-Chat still showing a win-rate of over 60%
against ChatGPT, although the advantage was less pronounced.

Human Evaluation



Human evaluations are often considered the gold standard for evaluating dialogue
models, and the researchers used this method to assess the Llama 2-Chat models’
helpfulness and safety. The models were compared to open-source models like
Falcon and MPT MosaicML as well as closed-source models like ChatGPT and PaLM
using over 4,000 single and multi-turn prompts.

The results showed that Llama 2-Chat models significantly outperformed
open-source models on both single turn and multi-turn prompts, with the Llama
2-Chat 34B model winning over 75% against comparably sized models. The largest
Llama 2-Chat model was also competitive with ChatGPT.

Three different annotators independently evaluated each model generation
comparison to ensure inter-rater reliability (IRR), which was measured using
Gwet’s AC1/2 statistic. Depending on the model comparison, the AC2 score varied
between 0.37 and 0.55.

However, the authors acknowledge that human evaluations have certain
limitations. For instance, while the 4,000 prompt set is large by research
standards, it doesn’t cover all possible real-world usage scenarios. The prompt
set lacked diversity and didn’t include any coding- or reasoning-related
prompts. Evaluations focused on the final generation of a multi-turn
conversation, not the entire conversation experience. Finally, the subjective
and noisy nature of human evaluations means results could vary with different
prompts or instructions.


SAFETY


SAFETY IN PRETRAINING



The authors discuss the pretraining data used for the Llama 2-Chat model and the
steps taken to pretrain it responsibly. They didn’t use any user data and
excluded certain sites that contain large amounts of personal information. They
also aimed to minimize their carbon footprint and avoided additional filtering
that could result in demographic erasure. However, the authors warn the model
should be deployed only after significant safety tuning.

The demographic representation in the training data was analyzed, revealing an
overrepresentation of “he” pronouns compared to “she” pronouns, which might lead
to more frequent usage of “he” in the model’s outputs. The top demographic
identity terms related to religion, gender, nationality, race and ethnicity, and
sexual orientation all showed a Western skew.

The authors found a small amount of toxicity in the pretraining data, which may
affect the output of the model. They also identified English as the dominant
language in the training data, suggesting the model might not work as
effectively with other languages.



The safety capabilities of Llama 2 were tested using three automatic benchmarks:
TruthfulQA for truthfulness, ToxiGen for toxicity, and BOLD for bias. Compared
to its predecessor, Llama 2 demonstrated increased truthfulness and decreased
toxicity. However, the 13B and 70B versions of Llama 2 exhibited increased
toxicity, potentially due to larger pretraining data or different dataset mixes.
While the authors noted an increase in positive sentiment for many demographic
groups, they emphasized the need for additional safety mitigations before
deployment and more comprehensive studies of the model’s real-world impact.


SAFETY FINE-TUNING



The authors discuss fine-tuning approach of a language model called Llama
2-Chat, outlining its techniques, safety categories, annotation guidelines, and
methods to mitigate safety risks.

 * Supervised Safety Fine-Tuning: Here, the team starts with adversarial prompts
   and safe demonstrations, included in the general supervised fine-tuning
   process. This helps align the model with safety guidelines early on.
 * Safety RLHF (Reinforcement Learning from Human Feedback): This method
   integrates safety into the general RLHF pipeline, which involves training a
   safety-specific reward model and gathering more adversarial prompts for
   better fine-tuning.
 * Safety Context Distillation: In this step, the model is refined by generating
   safer responses and distilling the safety context into the model. This is
   done using a targeted approach to choose if context distillation should be
   used for each sample.

Safety categories identified are illicit and criminal activities, hateful and
harmful activities, and unqualified advice. To cover different varieties of
prompts, they use risk categories and attack vectors, such as psychological
manipulation, logic manipulation, syntactic manipulation, semantic manipulation,
and others.

For fine-tuning, prompts and demonstrations of safe model responses are gathered
and used following the established guidelines. The model’s ability to write
nuanced responses improves through RLHF.

The research team also found that adding an additional stage of safety
mitigation does not negatively impact model performance on helpfulness. However,
with more safety data mixed in model tuning, the model does answer certain
questions in a more conservative manner, leading to an increase in the rate of
false refusals (where the model refuses to answer legitimate prompts due to
irrelevant safety concerns).

Lastly, context distillation is used to encourage the model to associate
adversarial prompts with safer responses, and this context distillation only
occurs on adversarial prompts to prevent the degradation of model performance.
The safety reward model decides whether to use safety context distillation or
not.


RED TEAMING

The researches discusses the application of red teaming as a proactive method of
identifying potential risks and vulnerabilities in Language Learning Models
(LLMs). These efforts involve more than 350 professionals from diverse fields
such as cybersecurity, election fraud, legal, civil rights, software
engineering, machine learning, and creative writing. The red teaming exercises
focused on various risk categories, like criminal planning, human trafficking,
privacy violations, etc., as well as different attack vectors. Some findings
indicated that early models often failed to recognize and handle problematic
content appropriately, but iterative improvements helped mitigate these issues.

Post-exercise, the data collected was analyzed thoroughly, considering factors
like dialogue length, risk area distribution, and the degree of risk. This
information was used for model fine-tuning and safety training. The
effectiveness of these red teaming exercises was measured using a robustness
factor, defined as the average number of prompts that would trigger a violating
response from the model per person per hour. For instance, on a 7B model, the
robustness improved significantly over several red teaming iterations and model
refinements.

The red teaming efforts continue to be a valuable tool in improving model safety
and robustness, with new candidate releases consistently reducing the rate of
prompts triggering violating responses. As a result, on average, there was a 90%
rejection rate model over model.


SAFETY EVALUATION OF LLAMA 2-CHAT



The authors used human evaluation method to assess the safety of Language
Learning Models (LLMs), specifically involving around 2,000 adversarial prompts.
The responses to these prompts were assessed by raters on a five-point Likert
scale, with 5 being the safest and most helpful, and 1 indicating severe safety
violations. A rating of 1 or 2 was considered as a violation.

The violation percentage served as the primary evaluation metric, with mean
rating as supplementary. Three annotators assessed each example, with a majority
vote determining if a response was violating safety guidelines. Inter-rater
reliability (IRR), measured using Gwet’s AC1/2 statistic, indicated a high
degree of agreement among annotators. The IRR scores varied depending on the
model being evaluated.

The overall violation percentage and safety rating of various LLMs showed that
Llama 2-Chat performed comparably or better than others. It is important to note
that the evaluations are influenced by factors such as prompt set limitations,
review guidelines’ subjectivity, content standards, and individual raters’
subjectivity.

There was a trend observed that multi-turn conversations were more likely to
induce unsafe responses across all models. However, Llama 2-Chat still performed
well, particularly in multi-turn conversations.

In terms of truthfulness, toxicity, and bias, fine-tuned Llama 2-Chat showed
great improvements over the pre-trained model. It showed the lowest level of
toxicity among all compared models. Moreover, Llama 2-Chat showed increased
positive sentiment for many demographic groups after fine-tuning. In-depth
analyses and results of truthfulness and bias were provided in the appendix.


DISCUSSIONS


LEARNINGS AND OBSERVATIONS


 * The findings suggest that reinforcement learning was particularly effective
   in the tuning process due to its cost and time efficiency. The success of
   RLHF (Reinforcement Learning from Human Feedback) hinges on the synergistic
   relationship it creates between humans and LLMs during the annotation
   process. Notably, RLHF helps overcome the limitations of supervised
   fine-tuning and can lead to superior writing abilities in LLMs.
 * An interesting phenomenon related to RLHF was observed — dynamic re-scaling
   of temperature contingent upon the context. For creative prompts, increased
   temperature continues to generate diversity across RLHF iterations. However,
   for factual prompts, despite the rising temperature, the model learns to
   provide consistent responses.
 * The Llama 2-Chat model also demonstrated robust temporal organization
   abilities, which suggests LLMs might have a more advanced concept of time
   than previously thought.
 * An intriguing finding is the emergence of tool use in LLMs, which emerged
   spontaneously in a zero-shot context. Even though tool-use was not explicitly
   annotated, the model demonstrated the capability to utilize a sequence of
   tools in a zero-shot context. While promising, LLM tool use can also pose
   safety concerns and requires further research and testing.





SIGN UP TO DISCOVER HUMAN STORIES THAT DEEPEN YOUR UNDERSTANDING OF THE WORLD.


FREE



Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.


Sign up for free


MEMBERSHIP



Access the best member-only stories.

Support independent authors.

Listen to audio narrations.

Read offline.

Join the Partner Program and earn for your writing.


Try for $5/month
Paper Review
Deep Learning
NLP
Large Language Models
Open Source


261

261



Follow



WRITTEN BY ANDREW LUKYANENKO

2.1K Followers
·Writer for

GoPenAI

Economist by education. Polyglot as a hobby. DS as a calling.
https://andlukyane.com/

Follow




MORE FROM ANDREW LUKYANENKO AND GOPENAI

Andrew Lukyanenko


PAPER REVIEW: DOCLLM: A LAYOUT-AWARE GENERATIVE LANGUAGE MODEL FOR MULTIMODAL
DOCUMENT…


LLM FOR INVOICES, CONTRACTS AND OTHER BORING DOCUMENTS

4 min read·Jan 7

50





Lucas Scott

in

GoPenAI


PYTHON IS OUT OF FAVOR?HUGGING FACE OPEN-SOURCES A NEW ML FRAMEWORK WHICH
WRITTEN IN RUST


HUGGING FACE HAS QUIETLY OPEN SOURCED AN ML FRAMEWORK — CANDLE


·5 min read·Sep 26, 2023

571

6




Sanjay Singh

in

GoPenAI


A STEP-BY-STEP GUIDE TO TRAINING YOUR OWN LARGE LANGUAGE MODELS (LLMS).


LARGE LANGUAGE MODELS (LLMS) HAVE TRULY REVOLUTIONIZED THE REALM OF ARTIFICIAL
INTELLIGENCE (AI). THESE POWERFUL AI SYSTEMS, SUCH AS GPT-3…

10 min read·Sep 30, 2023

146

1




Andrew Lukyanenko


PAPER REVIEW: TSMIXER: AN ALL-MLP ARCHITECTURE FOR TIME SERIES FORECASTING


MLP — NOW FOR TIME SERIES

5 min read·Sep 13, 2023

48




See all from Andrew Lukyanenko
See all from GoPenAI



RECOMMENDED FROM MEDIUM

Michael Humor

in

GoPenAI


HOW TO FINE-TUNE MIXTRAL-8X7B-INSTRUCT ON YOUR OWN DATA?


IT TAKES JUST A FEW MINUTES OVER THREE STEPS:

8 min read·Jan 20

188

2




Suman Das


FINE TUNE LARGE LANGUAGE MODEL (LLM) ON A CUSTOM DATASET WITH QLORA


THE FIELD OF NATURAL LANGUAGE PROCESSING HAS BEEN REVOLUTIONIZED BY LARGE
LANGUAGE MODELS (LLMS), WHICH SHOWCASE ADVANCED CAPABILITIES AND…

15 min read·Jan 24

125






LISTS


NATURAL LANGUAGE PROCESSING

1153 stories·621 saves


DATA SCIENCE AND AI

40 stories·58 saves


THE NEW CHATBOTS: CHATGPT, BARD, AND BEYOND

12 stories·290 saves


APPLE'S VISION PRO

7 stories·51 saves


Geronimo


FINETUNING LLAMA 2 AND MISTRAL


A BEGINNER’S GUIDE TO FINETUNING LLMS WITH QLORA

17 min read·Nov 5, 2023

429

14




Rania _Hossam


LLAMA FROM SCRATCH


META AI AND MICROSOFT HAVE JOINED FORCES TO INTRODUCE LLAMA 2, THE NEXT
GENERATION OF META’S OPEN-SOURCE LARGE LANGUAGE MODEL.

13 min read·Oct 22, 2023

154

2




Madhur Prashant


BENCHMARKING FOR MISTRAL MODELS + EVALUATING WITH OS EVAL KITS


WHAT IS MISTRAL?

9 min read·Dec 22, 2023



OpenVINO™ toolkit

in

OpenVINO-toolkit


INTRODUCING OPENVINO 2023.3 LTS


DISCOVER THE MOST IMPORTANT UPDATES WE’VE MADE TO OPENVINO, INCLUDING FRAMEWORK
CHANGES AND PLATFORM SUPPORT ENHANCEMENTS!

6 min read·Jan 24

23




See more recommendations

Help

Status

About

Careers

Blog

Privacy

Terms

Text to speech

Teams