---
title: How to Connect Hugging Face with Label Studio SDK
hide_sidebar: true
order: 1001
open_in_collab: true
tutorial: true
community_author: yyassi-heartex
ipynb_repo_path: tutorials/how-to-connect-Hugging-Face-with-Label-Studio-SDK/how_to_connect_Hugging_Face_with_Label_Studio_SDK.ipynb
repo_url: https://github.com/HumanSignal/awesome-label-studio-tutorials/tree/main/tutorials/how-to-connect-Hugging-Face-with-Label-Studio-SDK
report_bug_url: https://github.com/HumanSignal/awesome-label-studio-tutorials/issues/new
thumbnail: /images/tutorials/tutorials-hugging-face-ls-sdk.png
meta_title: How to Connect Hugging Face with Label Studio SDK
A Complete Guide to Connecting Hugging Face and Label Studio
This tutorial shows you how to create a seamless NLP workflow by integrating Hugging Face datasets and models with Label Studio for annotation and active learning.
We'll build a Named Entity Recognition (NER) annotation project using the WikiANN dataset and integrate pre-trained models for intelligent pre-labeling.
Before we dive into the code, let's understand the value of connecting Hugging Face with Label Studio.
This integration creates a powerful, automated ML workflow that transforms how you build and deploy NLP models.
Typical improvements across annotation projects:
| Metric | Without Integration | With HF + Label Studio | Improvement |
|---|---|---|---|
| Labeling speed | Baseline | 60-80% faster | ⚡ 2-5x speedup |
| Annotation accuracy | 90-95% | 98%+ | ✅ Fewer errors |
| Data preparation | Days of manual work | Minutes (automated) | ⏱️ Massive time savings |
| Model iteration | Static (train once) | Continuous improvement | 🔄 Active learning |
| Workflow complexity | Multiple disconnected tools | Single integrated pipeline | 🎯 Simplified workflow |
Project: Label 10,000 medical documents for entity extraction
| Step | Traditional Approach | HF + Label Studio | Time Saved |
|---|---|---|---|
| Data import | Manual download + formatting (8-10 hrs) | Automated import (5 min) | ~10 hrs |
| Initial labeling | Label all 10,000 docs (500-600 hrs @ 3 min/doc) | Label 500 docs to bootstrap (25 hrs) | — |
| Model training | Train once at end | Train on 500 examples (1 hr) | — |
| Remaining labels | — | Pre-annotate 9,500 docs (2 hrs) Review/correct (95-140 hrs @ 45 sec/doc) |
~360 hrs |
| Total time | ~530 hours | ~130 hours | 75% reduction |
| Outcome | Static model | Continuously improving model | ✅ |
Ready to build this workflow? Let's get started! 👇
# Install required packages
%pip install -q label-studio-sdk datasets transformers torch huggingface_hub accelerate
print("✅ All packages installed successfully!")
Set your environment variables before running:bash export LABEL_STUDIO_URL="http://localhost:8080" # or your Label Studio URL export LABEL_STUDIO_API_KEY="your-api-key-here"
To get your API key: Go to Label Studio → Account & Settings → Personal Access Token
import os
from label_studio_sdk import Client
# Get credentials from environment variables
ls_api_key = os.environ.get('LABEL_STUDIO_API_KEY')
ls_url = os.environ.get('LABEL_STUDIO_URL', 'http://localhost:8080')
if not ls_api_key:
raise ValueError('❌ Please set LABEL_STUDIO_API_KEY environment variable.')
# Connect to Label Studio
try:
ls = Client(url=ls_url, api_key=ls_api_key)
connection_status = ls.check_connection()
print(f'✅ Connected to Label Studio at {ls_url}')
print(f' Connection status: {connection_status}')
except Exception as e:
raise ConnectionError(f'❌ Failed to connect to Label Studio: {str(e)}')
Set your Hugging Face token:bash export HF_TOKEN="your-hf-token-here"
Get your token at: https://huggingface.co/settings/tokens
from huggingface_hub import login
# Get Hugging Face token (optional but recommended for accessing private models)
hf_token = os.environ.get('HF_TOKEN')
if hf_token:
try:
login(token=hf_token)
print('✅ Logged into Hugging Face Hub')
except Exception as e:
print(f'⚠️ Warning: HF login failed: {str(e)}')
print(' Continuing with public models only...')
else:
print('ℹ️ No HF_TOKEN provided. Using public models only.')
We'll create a Named Entity Recognition project with labels for Person (PER), Organization (ORG), Location (LOC), and Miscellaneous (MISC) entities.
# Define Label Studio labeling configuration for NER
ner_config = '''
<View>
<Text name="text" value="$text"/>
<Labels name="ner" toName="text">
<Label value="PER" background="#FF6B6B"/>
<Label value="ORG" background="#4ECDC4"/>
<Label value="LOC" background="#95E77D"/>
<Label value="MISC" background="#FFE66D"/>
</Labels>
</View>
'''
# Create project (or use existing one)
project_title = 'Hugging Face + Label Studio NER Tutorial'
project = ls.start_project(
title=project_title,
label_config=ner_config,
description='Tutorial: NER annotation with WikiANN dataset from HuggingFace'
)
print(f'✅ Project created successfully!')
print(f' Project ID: {project.id}')
print(f' Project URL: {ls_url}/projects/{project.id}')
We'll load the WikiANN dataset (a multilingual NER dataset) from Hugging Face. WikiANN provides high-quality NER annotations for English and 175+ other languages.
Why this matters: Direct dataset import eliminates manual data preparation, ensures consistency, and makes it easy to update your annotation project with new data.
from datasets import load_dataset
# Load dataset from Hugging Face
print('📦 Loading WikiANN dataset from Hugging Face...')
dataset = load_dataset('wikiann', 'en', split='train[:100]')
print(f' Loaded {len(dataset)} examples')
# Convert Hugging Face dataset format to Label Studio task format
tasks = []
for idx, row in enumerate(dataset):
# Join tokens into a single text string
text = ' '.join(row['tokens'])
# Create Label Studio task format
task = {
"data": {
"text": text
},
# Store original metadata for reference
"meta": {
"source": "wikiann",
"hf_index": idx
}
}
tasks.append(task)
# Import tasks into Label Studio using SDK
print(f'\n📤 Importing {len(tasks)} tasks into Label Studio...')
project.import_tasks(tasks)
# Verify import
imported_tasks = project.get_tasks()
print(f'✅ Successfully imported {len(imported_tasks)} tasks!')
print(f'\n📝 Sample task:')
print(f' Text: {imported_tasks[0]["data"]["text"][:100]}...')
⚠️ Action Required: Before continuing, go to Label Studio and label a few tasks (at least 10-20 for meaningful training).
Once you've labeled some data, continue to the next cell.
We'll export the labeled data and convert it to Hugging Face format for model training.
Why this matters: This automated conversion saves hours of manual data preparation and ensures your annotations are correctly aligned with model tokenization.
from transformers import AutoTokenizer
from datasets import Dataset
# Export annotations from Label Studio using SDK
print('📥 Exporting annotations from Label Studio...')
ls_data = project.export_tasks(export_type='JSON') # Already returns a list
# Check how many tasks have annotations
labeled_tasks = [task for task in ls_data if task.get('annotations')]
print(f' Total tasks: {len(ls_data)}')
print(f' Labeled tasks: {len(labeled_tasks)}')
if len(labeled_tasks) < 5:
print('\n⚠️ Warning: Very few labeled tasks. Results may not be meaningful.')
print(' Consider labeling more data in Label Studio before training.')
# Initialize tokenizer (using BERT for NER)
print('\n🔤 Loading tokenizer...')
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased', use_fast=True)
# Define NER label schema (BIO format)
label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
label_to_id = {label: idx for idx, label in enumerate(label_list)}
print(f' Label schema: {label_list}')
# Convert Label Studio annotations to HuggingFace format
print('\n🔄 Converting annotations to HuggingFace format...')
texts = [task['data']['text'] for task in ls_data]
tokenized = tokenizer(texts, return_offsets_mapping=True, truncation=True, padding=True)
all_labels = []
for i, task in enumerate(ls_data):
offsets = tokenized['offset_mapping'][i]
# Extract entity spans from Label Studio annotations
spans = []
annotations = task.get('annotations', [])
if annotations:
# Use the first annotation (or implement logic for multiple annotations)
results = annotations[0].get('result', [])
for result in results:
if result.get('type') == 'labels':
value = result['value']
spans.append((
value['start'],
value['end'],
value['labels'][0]
))
# Align spans with tokenized output (handle tokenization offsets)
token_labels = []
for token_start, token_end in offsets:
# Special tokens (CLS, SEP, PAD) have start==end
if token_start == token_end:
token_labels.append(-100) # Ignore in loss calculation
continue
# Find if this token overlaps with any entity span
label = 'O' # Default: Outside any entity
for span_start, span_end, span_label in spans:
# Check if token overlaps with span
if token_end <= span_start or token_start >= span_end:
continue # No overlap
# Determine if this is the beginning of an entity or inside
if token_start == span_start:
label = f'B-{span_label}' # Beginning of entity
else:
label = f'I-{span_label}' # Inside entity
break
token_labels.append(label_to_id[label])
all_labels.append(token_labels)
# Create HuggingFace Dataset
hf_dataset = Dataset.from_dict({
"input_ids": tokenized['input_ids'],
"attention_mask": tokenized['attention_mask'],
"labels": all_labels
})
print(f'✅ Conversion complete!')
print(f' Dataset size: {len(hf_dataset)} examples')
print(f'\n📝 Sample (first example):')
print(f' Input IDs shape: {len(hf_dataset[0]["input_ids"])}')
print(f' Labels shape: {len(hf_dataset[0]["labels"])}')
print(f' Sample tokens: {tokenizer.convert_ids_to_tokens(hf_dataset[0]["input_ids"][:20])}')
Note: This step demonstrates training a custom model. You can skip this and jump to Part 3 to use pre-trained models instead.
Why train: Fine-tuning on your labeled data creates domain-specific models that outperform general models on your specific use case.
from transformers import (
AutoModelForTokenClassification,
TrainingArguments,
Trainer,
DataCollatorForTokenClassification
)
import numpy as np
# Only train if we have enough labeled data
if len(labeled_tasks) < 10:
print('⚠️ Skipping training: Need at least 10 labeled examples.')
print(' Label more data in Label Studio, then re-run this cell.')
else:
print('🚀 Starting model training...')
# Initialize model
num_labels = len(label_list)
model = AutoModelForTokenClassification.from_pretrained(
'bert-base-cased',
num_labels=num_labels,
id2label={i: label for i, label in enumerate(label_list)},
label2id=label_to_id
)
# Split into train/validation
splits = hf_dataset.train_test_split(test_size=0.15, seed=42)
train_dataset = splits['train']
eval_dataset = splits['test']
print(f' Train set: {len(train_dataset)} examples')
print(f' Eval set: {len(eval_dataset)} examples')
# Data collator handles padding
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# Training arguments
training_args = TrainingArguments(
output_dir='./ner_model',
num_train_epochs=3,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
learning_rate=5e-5,
evaluation_strategy='epoch',
save_strategy='epoch',
save_total_limit=2,
logging_steps=10,
load_best_model_at_end=True,
push_to_hub=False, # Set to True to push to HuggingFace Hub
)
# Simple metrics function
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=2)
# Remove ignored index (special tokens)
true_predictions = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
true_labels = [
[label_list[l] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
# Simple accuracy
all_preds = [p for pred_list in true_predictions for p in pred_list]
all_labels = [l for label_list in true_labels for l in label_list]
accuracy = sum([p == l for p, l in zip(all_preds, all_labels)]) / len(all_labels)
return {"accuracy": accuracy}
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
# Train!
print('\n📚 Training in progress...')
train_result = trainer.train()
print('\n✅ Training complete!')
print(f' Final loss: {train_result.training_loss:.4f}')
# Save the model
trainer.save_model('./ner_model_final')
print('💾 Model saved to ./ner_model_final')
Now we'll connect a Hugging Face model as an ML backend to provide pre-annotations in Label Studio. This dramatically speeds up the annotation process!
Benefits:
- Pre-annotations: Model generates initial predictions for annotators to review
- Active Learning: Focus annotation efforts on difficult/uncertain examples
- Continuous Improvement: Retrain model with new labels, deploy updated predictions
We'll use a pre-trained NER model from Hugging Face. You can also use your custom trained model from Step 7.
from transformers import pipeline
import time
# Load a pre-trained Hugging Face NER model
print('🤗 Loading Hugging Face NER model...')
# Using a popular pre-trained NER model
ner_pipeline = pipeline(
"ner",
model="dslim/bert-base-NER",
aggregation_strategy="simple" # Combines B- and I- tags
)
print('✅ Model loaded successfully!')
# Map Hugging Face labels to our Label Studio labels
LABEL_MAPPING = {
'PER': 'PER',
'ORG': 'ORG',
'LOC': 'LOC',
'MISC': 'MISC'
}
def create_predictions_for_task(task):
"""Generate predictions for a single Label Studio task"""
text = task['data']['text']
# Get predictions from Hugging Face model
entities = ner_pipeline(text)
# Convert to Label Studio format
results = []
for entity in entities:
# Map label if needed
label = LABEL_MAPPING.get(entity['entity_group'], 'MISC')
result = {
"from_name": "ner",
"to_name": "text",
"type": "labels",
"value": {
"start": entity['start'],
"end": entity['end'],
"text": text[entity['start']:entity['end']],
"labels": [label]
},
"score": entity['score'] # Confidence score
}
results.append(result)
return results
# Test the prediction function with a sample
print('\n🧪 Testing prediction function...')
sample_text = "Apple Inc. was founded by Steve Jobs in California."
test_task = {"data": {"text": sample_text}}
predictions = create_predictions_for_task(test_task)
print(f' Input: "{sample_text}"')
print(f' Found {len(predictions)} entities:')
for pred in predictions:
entity_text = pred['value']['text']
entity_label = pred['value']['labels'][0]
entity_score = pred['score']
print(f' - "{entity_text}" → {entity_label} (confidence: {entity_score:.2f})')
Now let's use our Hugging Face model to generate predictions for unlabeled tasks in Label Studio!
# Get unlabeled tasks from Label Studio
print('📋 Fetching unlabeled tasks...')
all_tasks = project.get_tasks()
unlabeled_tasks = [task for task in all_tasks if not task.get('annotations')]
print(f' Total tasks: {len(all_tasks)}')
print(f' Unlabeled tasks: {len(unlabeled_tasks)}')
if len(unlabeled_tasks) == 0:
print('\n✅ All tasks are already labeled! No predictions needed.')
else:
# Generate predictions for unlabeled tasks
print(f'\n🔮 Generating predictions for {min(10, len(unlabeled_tasks))} tasks...')
prediction_count = 0
for task in unlabeled_tasks[:10]: # Start with first 10 for demo
try:
# Generate predictions using our HuggingFace model
results = create_predictions_for_task(task)
# Create prediction in Label Studio using SDK
project.create_prediction(
task_id=task['id'],
result=results,
model_version='huggingface-bert-base-NER'
)
prediction_count += 1
# Show progress
if prediction_count % 5 == 0:
print(f' Generated {prediction_count} predictions...')
except Exception as e:
print(f' ⚠️ Error on task {task["id"]}: {str(e)}')
continue
print(f'\n✅ Successfully created {prediction_count} pre-annotations!')
print(f'\n💡 Next steps:')
print(f' 1. Open Label Studio: {ls_url}/projects/{project.id}')
print(f' 2. Review and correct the pre-annotations')
print(f' 3. Submit your annotations')
print(f' 4. Export and retrain for continuous improvement!')
Congratulations! You've built a complete Hugging Face + Label Studio integration pipeline for Named Entity Recognition!
✅ HF → LS: Loaded WikiANN dataset from Hugging Face into Label Studio
✅ LS → HF: Exported labeled data and converted to Hugging Face format with token alignment
✅ HF → LS: Generated pre-annotations using Hugging Face NER models
✅ Trained: Fine-tuned a custom NER model on your labeled data
1. Import data from Hugging Face → Label Studio
2. Annotate tasks in Label Studio (with ML assistance)
3. Export annotations → Train/fine-tune Hugging Face model
4. Deploy updated model → Generate better predictions
5. Repeat for continuous improvement! 🔄
Want to adapt this workflow for other tasks? Check out:
- Text Classification: Adapt the export/prediction logic for sentiment analysis, topic classification, etc.
- Question Answering: Modify for extractive or generative QA tasks
- Summarization: Apply to text summarization workflows
For production use, you'll want to deploy your Hugging Face model as a persistent ML backend server. Here's how:
# Install label-studio-ml
pip install label-studio-ml
# Create a new ML backend project
label-studio-ml init my_ner_backend --script label_studio_ml/examples/simple_text_classifier.py
# Edit my_ner_backend/model.py to use your Hugging Face model
# Then start the server:
label-studio-ml start my_ner_backend --port 9090
Then connect it in Label Studio:
1. Go to Project Settings → Machine Learning
2. Click "Add Model"
3. Enter URL: http://localhost:9090
4. Enable "Use for interactive preannotations"
The approach used in this tutorial (SDK's create_prediction()) is perfect for:
- Batch processing of large datasets
- One-time pre-annotation runs
- Jupyter notebooks and data science workflows
- Prototyping and experimentation
| Feature | SDK Predictions | ML Backend Server |
|---|---|---|
| Real-time predictions | ❌ | ✅ |
| Interactive labeling | ❌ | ✅ |
| Batch processing | ✅ | ✅ |
| Easy setup | ✅ | ⚠️ Moderate |
| Production ready | ⚠️ | ✅ |
Choose based on your use case!