Auto-Labeling and Model Training with CVAT

CVAT Auto-Annotation tools are designed to speed up the labeling process by delegating manual tasks—like drawing bounding boxes or segmenting masks—to pre-trained machine learning models. Instead of labeling every frame from scratch, these tools can be used to generate draft annotations that humans then review and refine.

The standard CVAT workflow includes:

  • Data sourcing (e.g. cloud storage).

  • Labeling (manual or AI assisted).

  • Refinement: review and correction of AI-generated labels to ensure high-quality ground truth data.

  • Export: annotated data are exported in industry-standard formats compatible with frameworks like TensorFlow, PyTorch, or OpenVINO (e.g., COCO JSON, Indexed PNGs).

  • Training & Evaluation: structured data are used to train custom models.

In the next sections, some specific examples in this context will be discussed, namely: AI assisted labelling and Exporting of data for model training.

Automatic Annotation with cvat-cli

cvat-cli automates labeling by running local models and pushing results directly to CVAT tasks, making it easy to plug any external model in the annotation workflow.

This guide provides an introduction for using cvat-cli on UCloud including important commands and a more detailed walkthrough on how to run auto-annotation with built-in or custom models.

Create a task with cvat-cli

A CVAT task can be created in two different ways, by using the web interface or the command line.

The web interface is more convenient for small-scale projects, such as creating a few tasks with manually defined labels or hand-picking specific local files and folders.

Command line is efficient in case of large datasets, it is significantly faster for uploading massive amounts of data and very convenient for creating a big number of tasks at once using scripts. In addition it offers significant advantages when implementing auto-annotation workflows.

To get started, select CVAT among the UCloud applications (details on the app implementation and general use can be found here).

When the job starts click on the blue button Open interface and create an account. After that, open the job terminal, by clicking on Open terminal in the job progress view page.

A task named, for example, task_name and local images e.g. in the /work path (/work/picture1.jpg, /work/picture2.jpg, and /work/picture3.jpeg) can be created by running the command:

$ cvat-cli --server-host http://localhost:8080 --auth cvat_username task create "task_name" --labels labels.json local picture1.jpg picture2.jpg picture3.jpeg

The required username (cvat_username) and password, must correspond to the username and password of the CVAT account, created when opening the interface.

A simple labels.json file, for defining the task labels, looks like this:

[
    {
        "name": "apple",
        "attributes": []
    },
    {
        "name": "orange",
        "attributes": []
    }
]

Several options can be added to the task creation command line, like for example for using pictures from remote storage or for specifying image quality (see e.g. the official CVAT CLI documentation).

Auto-annote with built-in models

It is possible now to run the built-in model for detection. Below we use Faster R-CNN v2:

$ cvat-cli --server-host http://localhost:8080 --auth "cvat_username" task auto-annotate "task_id" --function-module cvat_sdk.auto_annotation.functions.torchvision_detection -p model_name=str:"fasterrcnn_resnet50_fpn_v2" -p score_thresh=float:0.5 --allow-unmatched-labels --clear-existing

Here:

  • --function-module specifies the Python module that implements the auto-annotation logic used by CVAT.

  • -p model_name=str:<model_identifier> specifies the exact architecture or pre-trained model variant to be used for the annotation task.

  • score_thresh indicates the value of the confidence threshold.

  • task_id is the ID of the task selected for annotation. The task ID can be read in the CVAT UI or by running the following command

    $ cvat-cli --server-host http://localhost:8080 --auth cvat_username task ls
    
  • --allow-unmatched-labels and --clear-existing are optional and can be useful as they respectively tell the CLI to ignore labels produced by the model that do not exist in the target task, and delete all current annotations in the task before starting the auto-annotation process.

In this example fasterrcnn_resnet50_fpn_v2 is the name of the selected model. Torchvision supported models includes:

  • fasterrcnn_resnet50_fpn_v2

  • fcos_resnet50_fpn

  • retinanet_resnet50_fpn

  • ssdlite320_mobilenet_v3_large

  • maskrcnn_resnet50_fpn

  • keypointrcnn_resnet50_fpn

Note

When creating labels in CVAT, either through the UI or by importing a labels.json file, the label names must exactly match those used by the model. For example, if the model expects a label named apple, the label in CVAT must also be called apple, not Apple or any other variation.

After the auto-annotation is done, one can verify the annotation by opening the CVAT UI and select the task.

Dealing with RGBA images

If the selected images have an alpha channel (4 channels), errors like:

RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 0

may appear. In this case it is necessary to patch CVAT SDK to avoid errors. To do that, edit the following file

/home/django/.local/lib/python<version>/site-packages/cvat_sdk/auto_annotation/functions/torchvision_detection.py

by adding the line

image = image.convert("RGB")

after the line

conf_threshold = context.conf_threshold or 0

Using Nvidia GPU with built-in models

It is possible to use GPUs instead of CPUs to run the built-in auto-annotation. In order to do this it is necessary to follow the steps below.

  • Open the file _torchvision.py available in the path:

    /home/django/.local/lib/python<version>/site-packages/cvat_sdk/auto_annotation/functions/_torchvision.py
    

    Add this line:

    self._model.to("cuda")
    

    after the object's instance variable self._model definition, i.e.

    self._model = torchvision.models.get_model(model_name, weights=self._weights, **kwargs)
    self._model.to("cuda")
    self._model.eval()
    
  • Open the file torchvision_detection.py, available in the path:

    /home/django/.local/lib/python<version>/site-packages/cvat_sdk/auto_annotation/functions/torchvision_detection.py
    

    and replace the line

    results = self._model([self._transforms(image)])
    

    with this code

    device = next(self._model.parameters()).device
    results = self._model([self._transforms(image).to(device)])
    

Now the auto-annotation process, using the built in models, will use the available GPU resources.

Adding custom models

CVAT supports auto-annotation using external models like YOLOv8 through its dedicated Auto-annotation API. The approach described here is valid also for other models. More info about the CVAT Auto-annotation API can be found here.

In case of custom models, it is necessary to create first an auto-annotation function. Here an example for using YOLOv8 to run auto-annotation:


"""
Filename: yolo-example-script.py
"""

from typing import List
import numpy as np
from PIL import Image
from ultralytics import YOLO
import cvat_sdk.auto_annotation as cvataa
import cvat_sdk.models as models


class _Yolov8AA:
    def __init__(self, weights_path: str, conf: float = 0.25, iou: float = 0.7, device: str = "cpu"):
        self.model = YOLO(weights_path)
        self.model.to(device)
        self.model.conf = conf
        self.model.iou = iou

        # Build CVAT label spec directly from model class names
        self._labels = list(self.model.names.values())
        self._spec = cvataa.DetectionFunctionSpec(
            labels=[cvataa.label_spec(name, idx, type="rectangle")
                    for idx, name in enumerate(self._labels)]
        )

    @property
    def spec(self) -> cvataa.DetectionFunctionSpec:
        return self._spec

    def detect(self, context: cvataa.DetectionFunctionContext,
               image: Image.Image) -> List[models.LabeledShapeRequest]:
        """Run YOLO on one frame and convert predictions to CVAT shapes."""

        # Convert RGBA to RGB
        if image.mode != "RGB":
            image = image.convert("RGB")

        thr = context.conf_threshold or self.model.conf
        preds = self.model(np.asarray(image))[0]

        shapes = []
        for box, cls, score in zip(preds.boxes.xyxy.cpu(),
                                   preds.boxes.cls.cpu(),
                                   preds.boxes.conf.cpu()):
            if score < thr:
                continue
            x1, y1, x2, y2 = map(float, box)
            shapes.append(cvataa.rectangle(int(cls), [x1, y1, x2, y2]))
        return shapes

def create(weights_path: str, conf: float = 0.25, iou: float = 0.7):
    """Factory called by cvat-cli when you pass --function-file."""
    return _Yolov8AA(weights_path, conf, iou)

The command to run the auto-annotation is then:

$ cvat-cli --server-host http://localhost:8080 --auth cvat_username task auto-annotate task_id --function-file path_to_yolo-example-script -p weights_path=str:yolov8m.pt -p conf=float:0.30 --allow-unmatched-labels --clear-existing

The command line is very similar to the one used for built-in auto-annotation models. The only difference is the use of the --function-file option to specify the path to the custom execution script. In addition it is necessary to define the weights_path, which in the example above is set to the YOLOv8m model ID. If not already present, it will be downloaded automatically upon its first run. Alternatively, it is possible to provide a path to a local custom weights file or use other YOLO variants:

  • yolov8n.pt (nano) - fastest, smallest

  • yolov8s.pt (small) - good balance

  • yolov8m.pt (medium) - better accuracy

  • yolov8l.pt (large) - high accuracy

  • yolov8x.pt (extra large) - best accuracy, slowest

To use GPUs to run auto-annotation using YOLO models it is necessary to change the script accordingly. In the example script above, the string cpu in this line

def __init__(self, weights_path: str, conf: float = 0.25, iou: float = 0.7, device: str = "cpu"):

must be changed to cuda, i.e.

def __init__(self, weights_path: str, conf: float = 0.25, iou: float = 0.7, device: str = "cuda"):

Note

In general, CVAT is flexible regarding the specific model to use. The key step is ensuring the model's annotations are converted into a format CVAT recognizes. Additionally, both the built-in and YOLO models are trained on the COCO dataset, more info can be found here.

YOLOv8 Training with CVAT Annotations

Training a YOLOv8 model using custom CVAT annotations is essentially a fine-tuning process. The workflow described below is a standard method used to accelerate data labeling by using a small manually annotated seed dataset to train a model that can then be used to auto-annotate the rest of the data.

Steps includes:

  • Manual annotation (or refinement after auto-annotation) of a subset of the data.

  • Export and validation setup: export annotations and manually create a val (validation) folder.

  • Model training: train a custom model using this small labeled set.

  • Full dataset auto-annotation: once trained, it is possible to automatically annotate the remaining images (the full dataset) via cvat-cli.

Export annotations

Precisely annotate a subset of data (e.g. manually) and run the following command to export annotations for model training:

$ cvat-cli --server-host http://localhost:8080 --auth cvat_username task export-dataset --format "Ultralytics YOLO Detection 1.0" --with-images yes "task_id" outputName.zip

Here, --format "Ultralytics YOLO Detection 1.0" specifies the output format, which uses a .txt label file for each image containing normalized bounding box coordinates. Selecting the correct format is essential for the specific workflow, as the dataset must follow a strict directory structure to be compatible with model training.

CVAT supports a wide variety of dataset formats beyond Ultralytics YOLO (see here for details).

In order to use the exported data, the archive needs to be unziped:

$ unzip outputName.zip -d outputName

The output folder has a structure similar to this:

OutputName/
├── data.yaml
├── images
│   └── train
├── labels
│   └── train

Here the images folder contains the raw visual data, the labels folder contains the annotations that describe where objects are in each image and the data.yaml file is the configuration file.

Within the folder, the outputName/images/train and outputName/labels/train directories cannot contain subfolders. If multiple source folders were used, it is necessary to move all files directly into the root of their respective folders.

Train the model

Before training the custom model, create the outputName/images/val and outputName/labels/val directories. Move a portion of the images (usually 10/20%) and their corresponding labels into these folders to serve as a validation set.

The auto generated data.yaml file may require manual updates, ensuring the paths to the training/validation data are correctly defined, e.g.:

path: /absolute/path/to/outputName
train: images/train
val: images/train
names:
  0: WhiteRook
  1: BlackRook

To fine-tune a YOLOv8 object detection model on a custom dataset using the Ultralytics Command Line Interface (CLI), write:

$ yolo task=detect mode=train model=yolov8n.pt data=outputName/data.yaml epochs=50

The model trains by correcting its errors against annotated images. It then saves the best version for real-time object detection.

Use the trained model

The best model can be found in the following folder:

runs/detect/train/weights/best.pt

The file best.pt can be used instead of the file yolov8m.pt with CVAT cli, by adding the path to the newly created weights file:

$ cvat-cli --server-host http://localhost:8080 --auth cvat_username task auto-annotate "task_id" --function-file path-to-yolo-script -p weights_path=str:runs/detect/trainX/weights/best.pt -p conf=float:0.30 --allow-unmatched-labels --clear-existing

here path-to-yolo-script refers to the Python script that contains the custom Auto-Annotation (AA) function logic.