3️Assessing the Scores

Now we will walk through the Scores table and analyze our evaluation results.

  1. Below is the screenshot of the all the runs that have evaluations. The runs outlined by the red box are the ones we have focused on in this example walkthrough.

  2. The evaluation scores show up under the Factuality column. As we only evaluated against the Factuality scorer, the rest of the scorer columns will show up blank for these runs. We can adjust the columns of the Scores table by clicking the "Show/Hide columns" button on the top right, which will open up a popup window, and toggling the options as we see fit.

  3. To view detailed information about an evaluation, either double click on the row you wish to review results for, or click on the three dot button on the row. This will open up a popup, and click on "Details."

  4. This will open a drawer to the right. Details that can be viewed include Template name, Input, Output, Context (if provided during run), Expected Output (if provided during run), Tags, and analysis on the evaluation scorer.

  5. Let's understand what the above screenshot's evaluation means. Initially, we used the prompt "Who is the CEO of {{company}}?" and included a data point as Amazon. For Amazon, we input "Andy Jassy" as the expected output (ground truth). However, the output from Mistral 's Tiny model incorrectly identified the CEO as "Jeff Bezos." This response received a factuality score of 0.25 which is classified as category D, indicating a discrepancy between the expert-provided answer and the model's response. To understand the Factuality rubric, navigate to the Custom Scorer page and open up the Factuality scorer under the LLM tab.

Next, we will proceed to the Run Analytics and Evaluation Analytics pages to delve deeper into our findings.

Last updated