What is bounding box annotation?

Bounding-box annotation is the process of drawing rectangles around objects in images or videos. This process helps models learn to localize and classify objects. Bounding boxes are the most widely used labeling primitive for object detection. They are fast to create, compact to store, and supported by all major training libraries. You may also see terms like "box labeling," "object boxing," or just "bboxes."

A bounding box is usually defined by its top-left corner (x, y) and its width and height. Some formats use two corners: (xmin, ymin, xmax, ymax). Certain formats normalize coordinates to 0–1, while others store absolute pixel values. The label typically includes a class (e.g., "car") and may include attributes (e.g., "occluded: true", "truncated: 30%"). For production datasets, teams often log provenance. These fields can include image ID, annotator ID, timestamp, and ontology version. This supports tracing for future audits or model regression.

Annotation formats depend on the ecosystem. COCO stores boxes as [x, y, width, height] in pixels, along with category and instance IDs. Pascal VOC uses XML with xmin/xmax/ymin/ymax. The YOLO family expects one line per object: the class ID and the normalized x_center, y_center, width, and height for each image. Converters are common, but each export should keep the class mapping and coordinate convention. This helps avoid subtle errors, such as off-by-one errors or normalization issues. For rotated objects (shipping labels, aerial scenes), teams may use oriented boxes or polygons. If you must use axis-aligned boxes, write guidelines for consistent padding around rotated items.

Clear guidelines remove ambiguity and speed up throughput. Decide how "tight" a box should be. For example, include a 2–3 pixel margin. Specify how to handle occlusions: label the visible part only, or approximate the full extent. Define how to treat truncation at image edges. Crowded scenes (retail shelves, traffic) require separate boxes per instance. Clearly state when to ignore tiny or heavily blurred objects. In the video, choose between single-frame boxes and track-level IDs for an object across frames. Tracking helps with action analytics, re-identification, and multi-camera stitching.

Quality assurance uses preventive and measurement controls. Preventive steps include ontology examples, near-misses, and calibration rounds before launch. Measurement often uses IoU (intersection-over-union) between a worker’s box and a gold/reference box at thresholds such as 0.5 or 0.75. Stricter thresholds apply to small objects. Reviewers also check attribute consistency, class confusions, and systematic bias, such as better labeling of large objects. Model-assisted labeling can pre-draw boxes. Low-confidence or low-IoU predictions go to senior reviewers, while high-confidence cases get lighter checks. Error analysis should inform the ontology and sampling so edge cases receive attention.

Example:

A warehouse safety team labels forklifts and pallets in CCTV footage. The ontology defines "forklift" and "pallet" with a "visible operator" attribute. Boxes must be tight, with a 2-pixel margin. For occlusions, workers label only the visible part. If a forklift is split by shelving, label each visible part as one box. In QA, require IoU ≥ 0.6 for "forklift," and ≥ 0.5 for "pallet" due to frequent partial views. A small model pre-labels frames. Entropy-based routing sends uncertain detections for human review.

Move beyond boxes when pixel-level shape matters, as with defect detection or medical contours. Use instance or semantic segmentation in these cases. If the internal structure matters —for example, for human pose or hand landmarks —use keypoint annotation. For critical text orientation (documents, street signs), oriented boxes or polygons help reduce label noise. Many programs start with boxes for speed, then switch to richer geometries after discovering failure modes.

‍