back
Back to Glossary

What is the YOLO annotation format?

The YOLO annotation format is a lightweight text representation for object detection and related tasks where each image has a companion .txt file with one row per object. Variants exist for oriented boxes, segmentation, and pose. Semantic variants include YOLO labels and YOLO txt format.

A standard detection label row contains five values: class id, x_center, y_center, width, and height. All coordinates are normalized to the image size, so values range from 0 to 1. For example, if x_center is 0.5 the object center is at half the image width. No file is created for images without objects. Widely used implementations document this requirement and provide exporters and validators.

Practical tips for teams:

  1. Keep a stable class-id mapping file in the repository and log its commit hash with each export.
  2. Validate that xywh values are within 0–1 and that widths and heights are positive.
  3. For videos, decide whether to store per-frame YOLO labels or convert to track-based schemas.
  4. When converting from VOC or COCO, unit-test a sample of images to catch off-by-one or normalization errors.
  5. Distinguish between ground-truth labels and model predictions. Prediction files may add a confidence value after xywh, but training labels usually omit it.

Teams prefer YOLO format for rapid iteration because it is compact, human-readable, and integrates with popular training libraries. Conversions from VOC, COCO, or proprietary schemas are common and should retain class mappings faithfully. For complex scenes or small objects, IoU-based audits can catch subtle errors introduced during export from tools. CVAT’s docs also outline YOLO families for detection, oriented boxes, segmentation, and pose, which many teams adopt as they scale beyond boxes.

Example:

A warehouse team annotates forklift and pallet positions for safety analytics. Each frame’s .txt file lists one row per object with normalized xywh values and class ids. A nightly job validates the ranges and flags frames with boxes that spill outside 0–1 bounds for review.