--- title: "How to Compare Two AI Models with Label Studio" hide_sidebar: true order: 1003 open_in_collab: true tutorial: true community_author: nass600 ipynb_repo_path: tutorials/how-to-compare-two-ai-models-with-label-studio/how_to_compare_two_ai_models_with_label_studio.ipynb repo_url: https://github.com/HumanSignal/awesome-label-studio-tutorials/tree/main/tutorials/how-to-compare-two-ai-models-with-label-studio report_bug_url: https://github.com/HumanSignal/awesome-label-studio-tutorials/issues/new thumbnail: /images/tutorials/tutorials-compare-ai-models.png meta_title: How to Compare Two AI Models with Label Studio meta_description: Learn how to compare and evaluate two AI models with the Label Studio SDK. --- ## Why this matters Teams compare models constantly, and “who wins?” changes by use case: - **Frontier model bake-offs**: early days, you’re comparing models from different labs on your data to pick a default. - **Cost vs quality**: compare a cheap vs expensive model variant from the same lab to see if quality loss is worth the savings. - **Production vs challenger**: test a fine-tuned model, new guardrails, or an improved RAG pipeline against today’s production model. Human evaluation tells you which model wins and crucially where it wins (by domain or question type), so you can choose the right model, route intelligently, or iterate your prompts/RAG. ## What you’ll build - A Label Studio project with a rubric: Winner (A/B/Tie) + Quality (1–5) + Notes - A small demo dataset you can swap for your own - A Colab that creates the project, imports tasks, fetches annotations, and analyzes results > Works with Label Studio OSS or Enterprise (we’ll call out optional Enterprise features). ## Prerequisites - A running Label Studio (OSS or Enterprise) you can reach from Colab or your local Python - LS_URL (e.g., http://localhost:8080 or your team URL) - LS_API_KEY (personal token) - Python 3.10+ (Colab is fine) - ~20–30 minutes to annotate 20 items (solo or with a colleague) ## 0. Setup We are going to lean on the Label Studio SDK for the whole project so you will need to have at hand two important pieces of information: 1. The base url (`LS_URL`) where your Label Studio instance is running. 2. An `LS_API_KEY` to authenticate your requests. If Colab can't reach your localhost you can: 1. Run this notebook locally with Jupyter 2. Tunnel the connection with services like ngrok or cloudflared ([ngrok example](https://dashboard.ngrok.com/get-started/setup/macos)) You can create a valid `API_KEY` in the Account & Settings menu of your Label Studio UI (top right corner clicking the you user avatar) ```python !pip -q install label-studio-sdk pandas numpy matplotlib scipy import os import json import time from dataclasses import dataclass from typing import List, Dict, Any import pandas as pd import numpy as np import matplotlib.pyplot as plt from scipy.stats import binomtest from label_studio_sdk import LabelStudio # For nicer dataframe display in Colab pd.set_option('display.max_colwidth', 160) # Configure Label Studio connection # Note: Colab cannot reach your local http://localhost:8000 unless you tunnel. # If testing locally, run this notebook in Jupyter, or expose LS via a tunnel. LS_URL = os.getenv("LS_URL", "http://localhost:8080") # <-- change if needed LS_API_KEY = os.getenv("LS_API_KEY", "YOUR_TOKEN") # <-- set your token ls = LabelStudio(base_url=LS_URL, api_key=LS_API_KEY) user = ls.users.whoami() print("Connected to Label Studio as:", user.username) print("LS_URL:", LS_URL) ``` ## 1. Create the Label Studio project ### What the project looks like This project shows: the question, two answers (Model A & B), and a rubric: Winner (A/B/Tie), Overall quality (1–5), and optional Notes. ![Labelling](https://github.com/HumanSignal/awesome-label-studio-tutorials/blob/main/tutorials/how-to-compare-two-ai-models-with-label-studio/how-to-compare-two-ai-models-with-label-studio-files/figure_1.png?raw=true) ```python LABEL_CONFIG = r"""