Cloud storage is used for importing tasks and exporting annotations in Label Studio. There are 2 basic types of cloud storages:
Also Label Studio has Persistent storages where LS storage export files, user avatars and UI uploads. Do not confuse Cloud Storages and Persistent Storage, they have completely different codebase and tasks. Cloud Storages are implemented in io_storages, Persistent Storage uses django-storages and it is installed in Django settings environment variables (see base.py).
Note: Dataset Storages were implemented in the enterprise codebase only. They are deprecated and not used.
This section uses GCS storage as an example, and the same logic can be applied to other storages.
This storage type is designed for importing tasks FROM cloud storage to Label Studio. This diagram is based on Google Cloud Storage (GCS), and other storages are implemented in the same way:
graph TD;
Storage-->ImportStorage;
ProjectStorageMixin-->GCSImportStorage;
ImportStorage-->GCSImportStorageBase;
GCSImportStorageBase-->GCSImportStorage;
GCSImportStorageBase-->GCSDatasetStorage;
GCSImportStorageLink-->ImportStorageLink
subgraph Google Cloud Storage
GCSImportStorage;
GCSImportStorageBase;
GCSDatasetStorage;
end
Storage (label_studio/io_storages/base_models.py): Abstract base for all storages. Inherits status/progress from StorageInfo. Defines validate_connection() contract and common metadata fields.
ImportStorage (label_studio/io_storages/base_models.py): Abstract base for source storages. Defines core contracts used by sync and proxy:
iter_objects(), iter_keys() to enumerate objectsget_unified_metadata(obj) to normalize provider metadataget_data(key) to produce StorageObject(s) for task creationgenerate_http_url(url) to resolve provider URL -> HTTP URL (presigned or direct)resolve_uri(...) and can_resolve_url(...) used by the Storage Proxyscan_and_create_links() to create ImportStorageLinks for tasksImportStorageLink (label_studio/io_storages/base_models.py): Link model created per-task for imported objects. Fields: task (1:1), key (external key), row_group/row_index (parquet/JSONL indices), object_exists, timestamps. Helpers: n_tasks_linked(key, storage) and create(task, key, storage, row_index=None, row_group=None).
ProjectStorageMixin (label_studio/io_storages/base_models.py): Adds project FK and permission checks. Used by project-scoped storages (e.g., GCSImportStorage).
GCSImportStorageBase (label_studio/io_storages/gcs/models.py): GCS-specific import base. Sets url_scheme='gs', implements listing (iter_objects/iter_keys), data loading (get_data), URL generation (generate_http_url), URL resolution checks, and metadata helpers. Reused by both project imports and enterprise datasets.
GCSImportStorage (label_studio/io_storages/gcs/models.py): Concrete project-scoped GCS import storage combining ProjectStorageMixin + GCSImportStorageBase.
GCSImportStorageLink (label_studio/io_storages/gcs/models.py): Provider-specific ImportStorageLink with storage FK to GCSImportStorage. Created during sync to associate a task with the original GCS object key.
This storage type is designed for exporting tasks or annotations FROM Label Studio to cloud storage.
graph TD;
Storage-->ExportStorage;
ProjectStorageMixin-->ExportStorage;
ExportStorage-->GCSExportStorage;
GCSStorageMixin-->GCSExportStorage;
ExportStorageLink-->GCSExportStorageLink;
label_studio/io_storages/base_models.py): Abstract base for target storages. Project-scoped; orchestrates export jobs and progress. Key methods:save_annotation(annotation) provider-specific writesave_annotations(queryset), save_all_annotations(), save_only_new_annotations() helperssync(save_only_new_annotations=False) background export via RQGCSExportStorage (label_studio/io_storages/gcs/models.py): Concrete target storage for GCS. Serializes data via _get_serialized_data(...), computes key via GCSExportStorageLink.get_key(...), uploads to GCS; can auto-export on annotation save when configured.
ExportStorageLink (label_studio/io_storages/base_models.py): Base link model connecting exported objects to Annotations. Provides get_key(annotation) logic (task-based or annotation-based via FF) and create(...) helper.
GCSExportStorageLink (label_studio/io_storages/gcs/models.py): Provider-specific link model holding FK to GCSExportStorage.
Run this command with try/except:
1. Get client
2. Get bucket
3. For source storage only: get any file from specified prefix
4. For target storage: we don't need to check prefix, because it should be created automatically when annotation is written
Target storages use the same validate_connection() function, but without any prefix.
iter_objects(), get_data()save_annotation(), save_annotations()validate_connection()Export storages use _get_serialized_data() which returns different formats based on feature flags:
- Default: Only annotation data (backward compatibility)
- With fflag_feat_optic_650_target_storage_task_format_long or FUTURE_SAVE_TASK_TO_STORAGE: Full task + annotations data instead of annotation per file output.
save_annotations() with built-in parallel processingmax_workers (default: min(8, cpu_count * 4))annotation.idtask.id + optional .json extensiondelete_annotation()can_delete_objects fieldStorage (Import and Export) have different statuses of synchronization (see class StorageInfo.Status):
graph TD;
Initialized-->Queued;
Queued-->InProgress;
InProgress-->Failed;
InProgress-->Completed;
Additionally, class StorageInfo contains counters and debug information that will be displayed in storages:
All these states are present in both the open-source and enterprise editions for code compatibility. Status processing can be challenging, especially when the sync process is terminated unexpectedly. Typical situations when this happens include:
Failed status.storage_background_failure wasn't called.storage_background_failure wasn't called.storage_background_failure wasn't called.To handle these cases correctly, all these conditions must be checked in ensure_storage_status when the Storage List API is retrieved.
The Storage Proxy API is a critical component that handles access to files stored in cloud storages (S3, GCS, Azure, etc.). It serves two main purposes:
Security & Access Control: It acts as a secure gateway to cloud storage resources, enforcing Label Studio's permission model and preventing direct exposure of cloud credentials to the client.
Flexible Content Delivery: It supports two modes of operation based on the storage configuration:
presign=True): Generates pre-signed URLs with temporary access and redirects the client to them. This is efficient as content flows directly from the storage to the client.presign=False): Streams content through the Label Studio server. This provides additional security and is useful when storage providers don't support pre-signed URLs or when administrators want to enforce stricter access control.When tasks contain references to cloud storage URIs (e.g., s3://bucket/file.jpg), these are converted to proxy URLs (/tasks/{task_id}/resolve/?fileuri=base64encodeduri).
When a client requests this URL, the Proxy API:
presign setting/tasks/<task_id>/resolve/ - for resolving files referenced in tasks/projects/<project_id>/resolve/ - for resolving project-level resourcesThis architecture ensures secure, controlled access to cloud storage resources while maintaining flexibility for different deployment scenarios and security requirements.
The Proxy Mode has been optimized with several mechanisms to improve performance, reliability, and resource utilization:
The override_range_header function processes and intelligently modifies Range headers to limit stream sizes:
RESOLVER_PROXY_MAX_RANGE_SIZE)bytes=123456-) to bounded onesbytes=0-)Time-Limited Streaming
The time_limited_chunker generator provides controlled streaming with timeout protection:
RESOLVER_PROXY_TIMEOUT)RESOLVER_PROXY_BUFFER_SIZE) for efficient memory usageResponse Header Management
The prepare_headers function manages HTTP response headers for optimal client handling:
RESOLVER_PROXY_CACHE_TIMEOUT)The Storage Proxy API behavior can be configured using the following environment variables:
| Variable | Description | Default |
|---|---|---|
RESOLVER_PROXY_BUFFER_SIZE |
Size in bytes of each chunk when streaming data | 64*1024 |
RESOLVER_PROXY_TIMEOUT |
Maximum time in seconds a streaming connection can remain open | 10 |
RESOLVER_PROXY_MAX_RANGE_SIZE |
Maximum size in bytes for a single range request | 7*1024*1024 |
RESOLVER_PROXY_CACHE_TIMEOUT |
Cache TTL in seconds for proxy responses | 3600 |
These optimizations ensure that the Proxy API remains responsive and resource-efficient, even when handling large files or many concurrent requests.
There are use cases where multiple storages can/must be used in a single project. This can cause some confusion as to which storage gets used when. Here are some common cases and how to set up mutliple storages properly.