What is image annotation?

Image annotation is the process of labeling images so that computer vision models can learn to recognize objects, regions, text, or actions. It turns raw pixels into training data by attaching human or model-generated labels to pixels, shapes, or entire images. Teams use annotation to build datasets for object detection, instance and semantic segmentation, keypoint and landmark detection, OCR, classification, and visual search.

Annotation can be manual, model-assisted, or automated. In model-assisted workflows, a pre-trained model proposes labels that humans correct(called as human in the loop), which speeds up throughput while preserving accuracy. Active learning is often used to pick the most informative images for review so the model improves quickly. Outputs are saved in common formats such as COCO JSON, Pascal VOC XML, YOLO TXT, or format-specific masks. In medical imaging, annotations may be saved as 3D label maps aligned to NIfTI volumes or as DICOM overlays.

Quality assurance is crucial. Good projects start with a clear label ontology and detailed guidelines, then measure quality with inter-annotator agreement, gold-standard checks, sampling-based review, and feedback loops. Attributes like occlusion, truncation, and confidence scores are often captured to make datasets more useful for training and evaluation.

Common annotation types used in machine learning:

Bounding boxes for object detection.
Polygons and per-pixel masks for instance or semantic segmentation.
Polylines for lanes, boundaries, and roads.
Keypoints and skeletons for pose or landmark detection.
Tags and attributes at the image or object level.
Text regions and transcripts for OCR.

Example

Scenario: You’re building a product-detection model for a fashion e-commerce catalog. Each product photo shows a model wearing multiple items.

What annotators do:

Draw bounding boxes around each item (dress, belt, heels, handbag).
Add labels and attributes (category = “dress”, color = “red”, pattern = “solid”, sleeve = “sleeveless”).
For tricky outlines (e.g., a handbag strap), use a polygon/per-pixel mask so the cutout is clean.
Save annotations to COCO JSON (or YOLO TXT/Pascal VOC XML).

What the annotation looks like (conceptually):

image_id: 1023
annotations:
- bbox: [315, 220, 180, 420], category: "dress", attributes: {color: "red", sleeve: "sleeveless"}
- bbox: [410, 500, 120, 140], category: "heels", attributes: {color: "black"}
- polygon: [...], category: "handbag", attributes: {material: "leather"}

Why this helps:

Train a model to auto-detect and crop products for PDP/PLP images.
Power visual search and attribute filters (“red dress”, “black heels”).
Evaluate quality with mAP/IoU and audit human work via inter-annotator agreement.

‍