langfuse.com
Open in
urlscan Pro
2606:4700:20::681a:547
Public Scan
Submitted URL: https://c.vialoops.com/CL0/https://langfuse.com/changelog/2024-11-19-llm-as-a-judge-for-datasets/1/010001934af1100a-907...
Effective URL: https://langfuse.com/changelog/2024-11-19-llm-as-a-judge-for-datasets
Submission: On November 22 via api from RU — Scanned from DE
Effective URL: https://langfuse.com/changelog/2024-11-19-llm-as-a-judge-for-datasets
Submission: On November 22 via api from RU — Scanned from DE
Form analysis
1 forms found in the DOM<form class="flex gap-y-2 w-full flex-row max-w-sm"><input type="email"
class="flex w-full rounded border border-input bg-background text-sm ring-offset-background file:border-0 file:bg-transparent file:text-sm file:font-medium placeholder:text-muted-foreground focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:cursor-not-allowed disabled:opacity-50 rounded-r-none px-4 py-2 h-9"
placeholder="Enter your email" value=""><button
class="inline-flex items-center justify-center rounded text-sm font-medium ring-offset-background transition-colors focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:pointer-events-none disabled:opacity-50 bg-secondary text-secondary-foreground hover:bg-secondary/80 h-9 px-3 rounded-l-none"
type="submit">Get updates</button></form>
Text Content
✨ Today: Support us on ✨✨ Today: Support Langfuse Prompt Experimentation on ✨ DocsGuidesFAQPricingChangelogBlogDemo CTRL K Discord GitHub 6.7K Sign Up CTRL K System ← Back to changelog November 19, 2024 | Launch Week 2 🚀 LLM-AS-A-JUDGE EVALUATORS FOR DATASET EXPERIMENTS Marlies Mayerhofer Introducing support for managed LLM-as-a-judge evaluators for dataset experiments. INTRODUCTION Building reliable AI applications is challenging because it’s hard to understand how changes impact performance. Without proper evaluation, teams end up playing whack-a-mole with bugs and regressions. Datasets and experiments in Langfuse help transform this uncertainty into a structured engineering process. Benefits of investing into datasets and development evaluations * Measure the impact of changes before deployment * Identify regressions early * Compare specific dataset items across different runs using reliable scores * Build stronger conviction in your test datasets by identifying gaps between test and production evaluations * Create reliable feedback loops for development Until now, datasets and experiments depended on custom evaluations that were added to the run via the SDKs/API. This is great if you need full flexibility or want to use your preferred evaluation library or scoring logic. There were LLM-as-a-judge evaluators, but they were limited to production runs and could not access the ground truth of your dataset (expected_output) which is necessary for a reliable offline evaluation. WHAT’S NEW? Day 2 of Launch Week 2 brings managed LLM-as-a-judge evaluators to dataset experiments. Assign evaluators to your datasets and they will automatically run on new experiment runs, scoring your outputs based on your evaluation criteria. You can run any LLM-as-a-judge prompt, Langfuse comes with templates for the following evaluation criteria: Hallucination, Helpfulness, Relevance, Toxicity, Correctness, Contextrelevance, Contextcorrectness, Conciseness Langfuse LLM-as-a-judge works with any LLM that supports tool/function calling that is accessible via the following APIs: OpenAI, Azure OpenAI, Anthropic, AWS Bedrock. Via LLM gateways such as LiteLLM, virtually any popular LLM can be used via the OpenAI connector. HOW IT WORKS SET UP YOUR LLM-AS-A-JUDGE EVALUATOR Evaluators in Langfuse consist of: * Dataset: Select which test examples (production cases, synthetic data, or manual tests) your evaluator should run on * Prompt: The prompt you want to use for evaluation including mapping variables from your dataset items to prompt variables * Scoring: A custom score name and comment format you’d like the LLM evaluator to produce * Metadata: Sampling rates to control costs, and delay to steer delay after running your experiment Learn more about LLM-as-a-judge evaluators in our evaluation documentation. RUN EXPERIMENTS Iterate on your application (prompts, model configuration, retrieval/application logic, etc.) and run an experiment via the Langfuse SDKs. Learn more in our datasets & experiments docs or run this end-to-end example (Python Notebook). ANALYZE RESULTS After successfully running your experiments, analyze results and scores produced by your evaluator in the Langfuse UI, using the dataset experiment run comparison view. Use the Langfuse UI to: 1. Compare metrics across experiment runs 2. Drill down into specific examples 3. Identify patterns in successes/failures 4. Track performance over time 5. Identify when to add more test cases to your dataset, as evaluation on test dataset is strong but production evaluation is weak LEARN MORE Check out our documentation for detailed guides on: * LLM-as-a-judge evaluators: How to set up your evaluator for production or test with the right dataset, prompt, scoring, and metadata * Datasets & Experiments: How to create and manage your development datasets, run experiments, and analyze results Last updated on November 20, 2024 WAS THIS PAGE USEFUL? YesCould be better QUESTIONS? WE'RE HERE TO HELP GitHub Q&AGitHubChat EmailTalk to sales SUBSCRIBE TO UPDATES Get updates System -------------------------------------------------------------------------------- Platform * LLM Tracing * Prompt Management * Evaluation * Manual Annotation * Datasets * Metrics * Playground Integrations * Python SDK * JS/TS SDK * OpenAI SDK * Langchain * Llama-Index * Litellm * Dify * Flowise * Langflow * Vercel AI SDK * Instructor * Mirascope * API Resources * Documentation * Interactive Demo * Video demo (3 min) * Changelog * Roadmap * Pricing * Enterprise * Self-hosting * Open Source * Why Langfuse? * Status * 🇯🇵 Japanese * 🇰🇷 Korean * 🇨🇳 Chinese About * Blog * Careers2 * About us * Support * Schedule Demo * OSS Friends * Twitter * LinkedIn Legal * Security * Imprint * Terms * Privacy © 2022-2024 Langfuse GmbH / Finto Technologies Inc. Questions? Reach out to the team!Support is onlineChat with Langfuse