langfuse.com Open in urlscan Pro
2606:4700:20::681a:547  Public Scan

Submitted URL: https://c.vialoops.com/CL0/https://langfuse.com/changelog/2024-11-19-llm-as-a-judge-for-datasets/1/010001934af1100a-907...
Effective URL: https://langfuse.com/changelog/2024-11-19-llm-as-a-judge-for-datasets
Submission: On November 22 via api from RU — Scanned from DE

Form analysis 1 forms found in the DOM

<form class="flex gap-y-2 w-full flex-row max-w-sm"><input type="email"
    class="flex w-full rounded border border-input bg-background text-sm ring-offset-background file:border-0 file:bg-transparent file:text-sm file:font-medium placeholder:text-muted-foreground focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:cursor-not-allowed disabled:opacity-50 rounded-r-none px-4 py-2 h-9"
    placeholder="Enter your email" value=""><button
    class="inline-flex items-center justify-center rounded text-sm font-medium ring-offset-background transition-colors focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:pointer-events-none disabled:opacity-50 bg-secondary text-secondary-foreground hover:bg-secondary/80 h-9 px-3 rounded-l-none"
    type="submit">Get&nbsp;updates</button></form>

Text Content

✨ Today: Support us on ✨✨ Today: Support Langfuse Prompt Experimentation on ✨
DocsGuidesFAQPricingChangelogBlogDemo
CTRL K
Discord
GitHub
6.7K
Sign Up
CTRL K

System


← Back to changelog
November 19, 2024 | Launch Week 2 🚀


LLM-AS-A-JUDGE EVALUATORS FOR DATASET EXPERIMENTS

Marlies Mayerhofer

Introducing support for managed LLM-as-a-judge evaluators for dataset
experiments.


INTRODUCTION

Building reliable AI applications is challenging because it’s hard to understand
how changes impact performance. Without proper evaluation, teams end up playing
whack-a-mole with bugs and regressions. Datasets and experiments in Langfuse
help transform this uncertainty into a structured engineering process.

Benefits of investing into datasets and development evaluations
 * Measure the impact of changes before deployment
 * Identify regressions early
 * Compare specific dataset items across different runs using reliable scores
 * Build stronger conviction in your test datasets by identifying gaps between
   test and production evaluations
 * Create reliable feedback loops for development

Until now, datasets and experiments depended on custom evaluations that were
added to the run via the SDKs/API. This is great if you need full flexibility or
want to use your preferred evaluation library or scoring logic. There were
LLM-as-a-judge evaluators, but they were limited to production runs and could
not access the ground truth of your dataset (expected_output) which is necessary
for a reliable offline evaluation.


WHAT’S NEW?

Day 2 of Launch Week 2 brings managed LLM-as-a-judge evaluators to dataset
experiments. Assign evaluators to your datasets and they will automatically run
on new experiment runs, scoring your outputs based on your evaluation criteria.

You can run any LLM-as-a-judge prompt, Langfuse comes with templates for the
following evaluation criteria: Hallucination, Helpfulness, Relevance, Toxicity,
Correctness, Contextrelevance, Contextcorrectness, Conciseness

Langfuse LLM-as-a-judge works with any LLM that supports tool/function calling
that is accessible via the following APIs: OpenAI, Azure OpenAI, Anthropic, AWS
Bedrock. Via LLM gateways such as LiteLLM, virtually any popular LLM can be used
via the OpenAI connector.


HOW IT WORKS


SET UP YOUR LLM-AS-A-JUDGE EVALUATOR

Evaluators in Langfuse consist of:

 * Dataset: Select which test examples (production cases, synthetic data, or
   manual tests) your evaluator should run on
 * Prompt: The prompt you want to use for evaluation including mapping variables
   from your dataset items to prompt variables
 * Scoring: A custom score name and comment format you’d like the LLM evaluator
   to produce
 * Metadata: Sampling rates to control costs, and delay to steer delay after
   running your experiment

Learn more about LLM-as-a-judge evaluators in our evaluation documentation.


RUN EXPERIMENTS

Iterate on your application (prompts, model configuration, retrieval/application
logic, etc.) and run an experiment via the Langfuse SDKs.

Learn more in our datasets & experiments docs or run this end-to-end example
(Python Notebook).


ANALYZE RESULTS

After successfully running your experiments, analyze results and scores produced
by your evaluator in the Langfuse UI, using the dataset experiment run
comparison view. Use the Langfuse UI to:

 1. Compare metrics across experiment runs
 2. Drill down into specific examples
 3. Identify patterns in successes/failures
 4. Track performance over time
 5. Identify when to add more test cases to your dataset, as evaluation on test
    dataset is strong but production evaluation is weak




LEARN MORE

Check out our documentation for detailed guides on:

 * LLM-as-a-judge evaluators: How to set up your evaluator for production or
   test with the right dataset, prompt, scoring, and metadata
 * Datasets & Experiments: How to create and manage your development datasets,
   run experiments, and analyze results

Last updated on November 20, 2024


WAS THIS PAGE USEFUL?

YesCould be better


QUESTIONS? WE'RE HERE TO HELP

GitHub Q&AGitHubChat EmailTalk to sales


SUBSCRIBE TO UPDATES

Get updates

System

--------------------------------------------------------------------------------

Platform

 * LLM Tracing
 * Prompt Management
 * Evaluation
 * Manual Annotation
 * Datasets
 * Metrics
 * Playground

Integrations

 * Python SDK
 * JS/TS SDK
 * OpenAI SDK
 * Langchain
 * Llama-Index
 * Litellm
 * Dify
 * Flowise
 * Langflow
 * Vercel AI SDK
 * Instructor
 * Mirascope
 * API

Resources

 * Documentation
 * Interactive Demo
 * Video demo (3 min)
 * Changelog
 * Roadmap
 * Pricing
 * Enterprise
 * Self-hosting
 * Open Source
 * Why Langfuse?
 * Status
 * 🇯🇵 Japanese
 * 🇰🇷 Korean
 * 🇨🇳 Chinese

About

 * Blog
 * Careers2
 * About us
 * Support
 * Schedule Demo
 * OSS Friends
 * Twitter
 * LinkedIn

Legal

 * Security
 * Imprint
 * Terms
 * Privacy


© 2022-2024 Langfuse GmbH / Finto Technologies Inc.



Questions? Reach out to the team!Support is onlineChat with Langfuse