Docs
Scores & Evaluation
Model-based Evaluation

Model-based Evaluations in Langfuse

Model-based evaluations (LLM-as-a-judge) are a powerful tool to automate the evaluation of LLM applications integrated with Langfuse. With model-based evalutions, LLMs are used to score a specific session/trace/LLM-call in Langfuse on criteria such as correctness, toxicity, or hallucinations.

There are two ways to run model-based evaluations in Langfuse:

  1. Via the Langfuse UI (beta)
  2. Via external evaluation pipeline

Via Langfuse UI (beta)

Where is this feature available?
  • Hobby
  • Pro
  • Team
  • Self Hosted
Model-based Evaluations in Langfuse

Store an API key

To use evals, you have to bring your own LLM API keys. To do so, navigate to the settings page and insert your API key. We store them encrypted on our servers.

Create an eval template

First, we need to specify the evaluation template:

  • Select the model and its parameters.
  • Customise one of the Langfuse managed prompt templates or write your own.
  • We use function calling to extract the evaluation output. Specify the descriptions for the function parameters score and reasoning. This is how you can direct the LLM to score on a specific range and provide specific reasoning for the score.
Langfuse

Create an eval config

Second, we need to specify on which traces Langfuse should run the template we created above.

  • Select the evaluation template to run.
  • Specify the name of the scores which will be created as a result of the evaluation.
  • Filter which newly ingested traces should be evaluated. (Coming soon: select existing traces)
  • Specify how Langfuse should fill the variables in the template. Langfuse can extract data from trace, generations, spans, or event objects which belong to a trace. You can choose to take Input, Output or metadata form each of these objects. For generations, spans, or events, you also have to specify the name of the object. We will always take the latest object that matches the name.
  • Reduce the sampling to not run evals on each trace. This helps to save LLM API cost.
  • Add a delay to the evaluation execution. This is how you can ensure all data arrived at Langfuse servers before evaluation is exeucted.
Langfuse

See the progress

Once the configuration is saved, Langfuse will start running the evals on the traces that match the filter. You can see the progress on the config page or the log table.

Langfuse

See scores

Upon receiving new traces, navigate to the trace detail view to see the associated scores.

Langfuse

Via External Evaluation Pipeline

Where is this feature available?
  • Hobby
  • Pro
  • Team
  • Self Hosted

You can run your own model-based evals on data in Langfuse by fetching traces from Langfuse (e.g. via the Python SDK) and then adding evaluation results as scores back to the traces in Langfuse. This gives you full flexibility to run various eval libraries on your production data and discover which work well for your use case.

The example notebook is a good template to get started with building your own evaluation pipeline.

Was this page useful?

Questions? We're here to help

Subscribe to updates