πEvaluation Analytics
After running evaluations, you can visualize the scores using charts on the Evaluation Analytics page. Here, users can specify a time period and select specific scorers. The tags defined earlier can also be used to filter the dataset for which you want to see evaluations. These charts help users gauge how a model is performing on a specific dataset, allowing for filtering and examination of performance using designated tags at the top of the page. This page is the perfect complement to the Run Analytics page's focus on inference metrics.
Features
Aggregated Metrics Viewing: Users can view aggregated metrics across multiple evaluations, enabling a comprehensive understanding of overall performance.
Detailed Evaluation Analysis: The page allows users to deep dive into individual evaluations, identifying specific inferences that may be negatively impacting model performance.
Data Visualization: The Analysis Page supports advanced data analysis through bar and scatter plots for various evaluation methods. This visual approach facilitates a more intuitive understanding of the data.
Direct Inference Insights: Users can directly access details about low-performing inferences from the charts. This feature simplifies the process of pinpointing issues within the data.
Fine-Tuning Dataset Creation: One of the unique capabilities of the analysis page is the ability to create fine-tuning datasets based on evaluation scores. This streamlines the process of improving model performance by leveraging direct insights from the evaluations.
Page Components
Filtering
Users are able to view information on runs based on inputting a date range, model(s), and/or tags at the top of the page. The below information will populate (if there is data that matches the filter criteria) after clicking on "Filter":
Latency: The average latency for all the runs in the time period
Tokens: Users can view three variants of tokens
Prompt Tokens: The individual elements of input text that are processed by the model
Completion Tokens: The tokens that the model generates in response to a given prompt
Tokens: The total prompt and completion tokens used
Cost: The cost incurred for all inferences for the given time period
Queries: The total number of queries in the given time period
Charts
Below this, there are bar charts and dodge plots displayed for the specified scorers in the filter (if there is data for the given filter). Below is an example screenshot:
Tag Leverage
Tags are a powerful tool for filtering and aggregating data on the Evaluation Analytics page. Hereβs how to use them effectively:
Filtering with Tags: Use the top filter bar to input all relevant tags for the data you wish to analyze. Tags allow for a focused view of aggregated stats across various metrics.
Tagging Strategy: We recommend tagging runs based on an experiment-level tag. Then, use this single tag in the monitoring view to obtain both aggregated and granular views of all runs related to that experiment. This approach simplifies the process of monitoring and analyzing specific experiments or models.
Last updated