Evaluating model responses is a central function of Evaluable AI's suite designed for enhancing model performance. We offer various methods and techniques for evaluation.

Types of Evaluation

There are three types of evaluation scorers, listed below. Click on the individual links to learn about the benefits of each type and how to leverage them. On the UI, these scorers can be viewed and modified on the Custom Scorer page.

  1. HumanComing soon!

How to Evaluate

  1. From the Prompt Runs page, select a run or multiple runs you wish to evaluate. You can do this by clicking on the "Action" dropdown and selecting "Evaluate," or use the "Evaluate" button in the top right corner for bulk actions.

  2. After clicking "Evaluate," a popup will appear, displaying the different evaluation methods available. If you have an expected outcome (Gold Standard), you'll be able to choose Static Evaluations. Other evaluation options, like using LLMs for evaluation, are available regardless of having an expected outcome and use a judge LLM to grade responses. The popup also indicates how many runs you have selected for evaluation.

For an in-depth walkthrough, see Initiating an Evaluation on how to start an evaluation on the Prompt Runs page.

Viewing the Results

After the evaluation is complete, the results will be displayed in a table similar to the Prompt Runs table, with all scores listed for easy comparison. See Viewing and Evaluating the Run for an in-depth walkthrough.

Last updated