โ˜„๏ธLLM-based Scorers

In this approach, we use a โ€œJudge LLM" to assess responses according to a rubric you provide. This method is inspired by the concept of using LLMs as evaluators, aiming to automate the first layer of the evaluation and making the data scientist's job easy, at a minimum. Using LLMs for evaluation is cost-effective and fast, although there might be concerns about reliability and accuracy in some cases.

At Evaluable AI, we offer two ways of leveraging LLM-based scorers: template and custom LLM-based scorers.

Template LLM-based Scorers

We currently offer two template evaluators on our platform:

  • Sentiment: Evaluating sentiment analysis capabilities of LLMs involves a combination of quantitative metrics and qualitative analyses to determine how well the models can understand, interpret, and generate sentiment-related text.

  • Factuality: Evaluating the factuality of LLMs is essential to ensure they provide accurate and reliable information in their responses. This involves several specific methodologies and approaches to assess whether the content generated by LLMs adheres to factual correctness and doesn't perpetuate misinformation or generate erroneous content.

These scorers cannot be edited or deleted. However, they can be cloned and modified for your use case through the following steps:

  1. On the Custom Scorer page, find the scorer you wish to clone and click on the three dot button on the right. Click "Clone." See example screenshot below.

  2. This will open up a popup to confirm cloning. Click on "Clone."

Evaluate Against Template Scorer

For a high-level review of how to evaluate and review an evaluation/score, refer to the demo on the Evaluation page.

Custom LLM-based Scorer

We also offer the ability to add your own custom scorer, which allows for a more tailored, relevant, and effective assessment process. Below are some benefits of creating your own custom LLM-based scorer:

  • Customization: Every application of an LLM may have unique requirements based on its use case.

  • Enhanced Accuracy: Generic scoring systems may not capture all the nuances of how an LLM performs in specialized or niche contexts. A custom scorer can incorporate domain-specific metrics that more accurately reflect the performance of the model in the intended environment.

  • Integration of Complex Metrics: Custom evaluators can include metrics that are not typically captured by standard evaluation. These could include measures of creativity, empathy, or the ability to handle ambiguous inputsโ€”important for fields like customer service, therapy, or creative writing.

  • Drive Targeted Improvements: Custom metrics can provide detailed insights into aspects of performance that need enhancement, facilitating more efficient and effective upgrades to the model.

Create a Custom LLM-based Scorer

  1. On the top right of the Custom Scorer page within the LLM tab, click on "Create LLM Scorer."

  2. A form will open, allowing the user to enter the following details:

    • Name

    • Description: Add a brief description to explain the purpose of this scorer.

    • Model: Select a judge model between the Mistral, GPT, and Gemini Pro variants from the drop-down menu.

    • Model Configuration: User can enable logprobs and select an n-value if they so choose.

    • Grading: User can add the grading criteria and associate a color for each grade. Example: A 90%, B 80%, C 70%, D 60%, E 50%

    • Prompt: Enter name for the [Task] or delete the section.

  3. Click "Save" at the top right of the page. This will redirect back to the Custom Scorer page and the newly created scorer will have the USER tag associated with it, signifying that it is not a SYSTEM created scorer.

Evaluate Against Custom Scorer

For a high-level review of how to evaluate and review an evaluation/score, refer to the demo on the Evaluation page.

Last updated