The COCO (Common Objects in Context) dataset is a large-scale benchmark for computer vision tasks including object detection, instance segmentation, person keypoints, and image captioning.
It contains ~330k images (about 200k annotated) with 80 “thing” categories and over 1.5M labeled object instances, making it a standard resource for training and evaluation.
COCO’s impact comes from three ingredients. First, rich annotations: each object is labeled with a class id, bounding box, and—unlike older datasets—polygon masks that trace the object boundary. People also include 17 body keypoints for pose estimation. Second, “in context” scenes: everyday images feature clutter, occlusion, and scale variance, forcing models to generalize beyond clean lab shots. Third, standard metrics: mean Average Precision (mAP) and related measures define a common yardstick across tasks and challenges.
A typical COCO object-detection annotation file (JSON) holds dataset-level info, licenses, an images list (with width, height, id), an annotations list (image_id, category_id, bbox, segmentation, iscrowd), and a categories list mapping ids to names. Instance segmentation stores polygons or RLE masks; keypoints use ordered landmark arrays with visibility flags. Many libraries provide loaders, visualizers, and converters for this format.
How teams use COCO in practice:
Strengths and limits
COCO offers breadth across common household and street objects, strong baselines, and reproducible metrics—ideal for detection, segmentation, and pose method development.
But it is not a domain dataset: medical, industrial, aerial, or retail SKU tasks still require custom ontologies and fresh labels. Teams often bootstrap with COCO-pretrained models, then rely on model-assisted labeling and active learning to curate high-signal, project-specific data.