πŸ”’Heuristic Scorers

Some benefits of leveraging an heuristic (algorithmic) scorer include:

  • Targeted Testing: Custom evaluators can be developed to probe particular weaknesses or failure modes of an LLM, such as handling of edge cases, resistance to adversarial inputs, or performance under data scarcity.

  • Performance Optimization: Developers can target their efforts more effectively and optimize the model’s performance in ways that are most beneficial for its application by identifying specific areas where the LLM underperforms.

  • Custom Metrics: Organizations can define their own metrics based on what success looks like for their particular use case.

  • Tailored Assessment: Custom evaluations can be designed to reflect the real-world contexts in which the model will be deployed, ensuring that the model's performance is tested under conditions similar to those it will face during actual use.

Established Scorers

At Evaluable AI, we use the following tools for heuristic/algorithmic evaluations:

  • ROUGE: ROUGE-N (Recall-Oriented Understudy for Gisting Evaluation - N) is a metric used to evaluate the quality of summaries by measuring the overlap of N-grams between the system-generated text and the reference texts. It focuses on the precision, recall, and F-measure of matching N-grams, providing insights into the content overlap and, indirectly, the relevance and completeness of the generated summaries. This scorer can be configured to assess various lengths of N-grams, offering flexibility in evaluating the granularity of the textual match.

  • BLEU: The BLEU (Bilingual Evaluation Understudy) scorer quantitatively evaluates the quality of machine-translated text against one or more reference translations. It employs a precision-based approach by comparing n-grams of the translated text to those in the reference texts, adjusting for proper length with a brevity penalty. This method allows for a nuanced assessment of translation accuracy, measuring how closely a machine's translation mirrors human translations. A score closer to 1 indicates a higher resemblance to the reference texts, making BLEU a standard metric in computational linguistics and machine translation research.

  • METEOR: The METEOR (Metric for Evaluation of Translation with Explicit Ordering) scorer extends beyond basic lexical matching to consider synonyms, paraphrasing, and stemming, providing a more nuanced assessment of translation quality. It aligns words between the generated text and reference texts, calculates precision and recall, applies a penalty for excessive short matches, and derives an F-mean score adjusted by the penalty to produce the final METEOR score. This approach aims to closely mimic human judgment, offering a balance between precision and recall in translation evaluation.

  • WER: Word Error Rate (WER) is a metric for evaluating the accuracy of speech recognition or machine translation systems. It measures the minimum number of word-level changes (insertions, deletions, and substitutions) required to transform the system-generated text into the reference text, normalized by the total number of words in the reference text. WER provides a clear indication of the textual difference at the word level, offering insights into the system's performance in terms of understanding and generating correct words.

  • Is Valid JSON: This scorer validates the structure of a given field to determine if it is a correctly formatted JSON. It supports both JSON objects and arrays, ensuring comprehensive validation. The scorer attempts to parse the input string; if parsing succeeds without errors, the input is deemed valid JSON, otherwise invalid. This functionality is crucial for data validation processes where proper JSON formatting is required.

Evaluate Against an Established Scorer

Evaluable AI provides common scorers to leverage for evaluations. For a high-level review of how to evaluate and review the evaluation generated from the score, refer to the demo on the Evaluation page.

Custom Heuristic Scorer

The algorithmic function outputs a double value on a scale from 0 to 1. Users can categorize different ranges on this scale; for instance, assigning a specific color to values between 0.1 and 0.2, and a different color for values between 0.2 and 0.3, and so on. The parameters that the function takes in are:

  • {string} llmInput

  • {Object} llmContext

  • {string} llmOutput

  • {string} groundTruth

  • {Object} metaData

Metadata is a JSON object that is passed in by the user during the time of inference. Any properties or nested properties in the JSON object can be used to help in validation within the algorithmic custom scorer. Below is an example:

const main = (llmInput, llmContext, llmOutput, groundTruth, metaData) => { 

    if (llmOutput.contains metaData.firstName && metaData.lastName) { return 1 } 
    else { return 0 } 
};

export { main };

Create a Custom Heuristic Scorer

  1. On the top right of the Custom Scorer page within the Heuristics tab, click on "Create Heuristic Scorer."

  2. A form will open, allowing the user to enter the following details:

    • Name

    • Description: Add a brief description to explain the purpose of this scorer.

    • Grading: User can add the grading criteria and associate a color for each range.

    • Code: Copy/paste code within the function. Changing the name and arguments of the function will cause failures in output score generation.

  3. Click "Save" at the top right of the page. This will redirect back to the Custom Scorer page and the newly created scorer will have the USER tag associated with it, signifying that it is not a SYSTEM created scorer.

Evaluate Against Custom Scorer

For a high-level review of how to evaluate and review an evaluation/score, refer to the demo on the Evaluation page.

Last updated