Vision-Language-Action Models

LLM Training

Video Foundation Models

Image Tools

Dictionaries & Lexicons

Cryptocurrency Tools

Watermark Removal Tools

OCR Tools

Voice Interaction Models

AI Service Tools

ToolBoost >> Data Processing Tools >> LightlyStudio: Reduce Annotation Costs Through Intelligent Data Curation

LightlyStudio: Reduce Annotation Costs Through Intelligent Data Curation

10月29日 Published inData Processing Tools

LightlyStudio streamlines data curation, labeling, and management for computer vision projects. It provides a high-performance Python interface powered by a Rust backend, enabling users to index, query, and slice massive datasets whether they are stored on local drives or in cloud environments like S3 and GCS. The platform natively supports standard formats including YOLO object detection, COCO instance segmentation, and COCO captions.

The primary advantage of LightlyStudio is its automated data selection. The tool identifies the specific subsets of data that offer the most value by finding samples that are both representative and diverse. This significantly reduces total annotation costs, shortens training cycles, and improves the final quality of the model.

Installation

LightlyStudio is compatible with Windows, Linux, and macOS. It requires Python 3.8 or newer.

pip install lightly-studio

Once installed, the environment is ready for use.

Getting Started

To explore the platform's features, you can download the following example datasets:

git clone https://github.com/lightly-ai/dataset_examples dataset_examples

Below are examples of how to load various dataset types.

Image Folders

To manage a directory of raw images, create a file named example_image.py:

import lightly_studio as ls

dataset = ls.Dataset.create()
dataset.add_samples_from_path(path="dataset_examples/coco_subset_128_images/images")

ls.start_gui()

Running python example_image.py will launch the web interface at localhost:8001.

YOLO Object Detection

For YOLO-formatted datasets, create example_yolo.py:

import lightly_studio as ls

dataset = ls.Dataset.create()
dataset.add_samples_from_yolo(
    data_yaml="dataset_examples/road_signs_yolo/data.yaml",
)

ls.start_gui()

After running the script, images and their associated bounding boxes will appear in the application.

COCO Instance Segmentation

For instance segmentation tasks, create example_coco.py:

import lightly_studio as ls

dataset = ls.Dataset.create()
dataset.add_samples_from_coco(
    annotations_json="dataset_examples/coco_subset_128_images/instances_train2017.json",
    images_path="dataset_examples/coco_subset_128_images/images",
    annotation_type=ls.AnnotationType.INSTANCE_SEGMENTATION,
)

ls.start_gui()

Upon execution, segmentation masks will be displayed alongside the images.

COCO Captions

For captioning datasets, create example_coco_captions.py:

import lightly_studio as ls

dataset = ls.Dataset.create()
dataset.add_samples_from_coco_caption(
    annotations_json="dataset_examples/coco_subset_128_images/captions_train2017.json",
    images_path="dataset_examples/coco_subset_128_images/images",
)

ls.start_gui()

The application will display the associated captions in the viewer.

Python API

The Python interface allows for programmatic indexing, querying, and manipulation of datasets.

The Dataset Object

The Dataset object is the central hub for your data. You can ingest samples from various sources at any time.

import lightly_studio as ls

dataset = ls.Dataset.create()

# Import from cloud storage
dataset.add_samples_from_path(path="s3://my-bucket/path/to/images/")

# Append data from additional sources
dataset.add_samples_from_path(path="gcs://my-bucket-2/path/to/more-images/")
dataset.add_samples_from_path(path="local-folder/some-data-not-in-the-cloud-yet")

# Load a previously saved database file
dataset = ls.Dataset.load()

The Sample Object

A sample represents an individual data point. You can retrieve or modify its attributes directly.

for sample in dataset:
    pass

samples = list(dataset)

s = samples[0]
s.sample_id        # UUID
s.file_name        # e.g., "img1.png"
s.file_path_abs    # Absolute file path
s.tags             # List of strings, e.g., ["tag1", "tag2"]
s.metadata["key"]  # Access specific metadata

# Modifications
s.tags = {"tag1", "tag2"}
s.metadata["key"] = 123
s.add_tag("some_tag")
s.remove_tag("some_tag")

Dataset Queries

Queries allow you to combine filtering, sorting, and slicing through logical expressions.

from lightly_studio.core.dataset_query import AND, OR, NOT, OrderByField, SampleField

# Identify samples that require labeling or are small and unreviewed
query = dataset.match(
    OR(
        AND(
            SampleField.width < 500,
            NOT(SampleField.tags.contains("reviewed"))
        ),
        SampleField.tags.contains("needs-labeling")
    )
)

# Sort results by width in descending order
query.order_by(OrderByField(SampleField.width).desc())

# Extract a specific slice
subset = query[10:20]

# Chained operations
query = dataset.match(...).order_by(...)[...]

# Apply actions to query results
query.add_tag("needs-review")

for sample in query:
    pass

samples_list = query.to_list()

# Export results for labeling in COCO format
query.export().to_coco_object_detections()

Automated Sample Selection

The core strength of LightlyStudio lies in its ability to isolate the most useful samples for labeling. You can balance two primary signals: typicality (representing common cases) and diversity (representing unique or edge cases).

from lightly_studio.selection.selection_config import (
    MetadataWeightingStrategy,
    EmbeddingDiversityStrategy,
)

# Calculate the typicality of each sample
dataset.compute_typicality_metadata(metadata_name="typicality")

# Select 10 samples by balancing typicality and diversity
dataset.query().selection().multi_strategies(
    n_samples_to_select=10,
    selection_result_tag_name="multi_strategy_selection",
    selection_strategies=[
        MetadataWeightingStrategy(metadata_key="typicality", strength=1.0),
        EmbeddingDiversityStrategy(embedding_model_name="my_model_name", strength=2.0),
    ],
)

By prioritizing the most informative data, you reduce manual labeling effort while ensuring the model gains the necessary knowledge to perform reliably.

▶ Visit

Related Tools

LightlyStudio: Reduce Annotation Costs Through Intelligent Data Curation

Chinese Wikipedia Corpus: Processing 990k Articles for NLP Tasks

QSV: Slice, Query, and Clean Massive CSV Files with High Performance

ReCode: Recursive Code Generation for LLM Agents