Inter-annotator agreement (IAA) tells you how consistently different people apply the same labeling rules. High agreement signals clear guidelines and reliable labels; low agreement usually means the task is ambiguous, the schema is unclear, or the examples don’t reflect reality. Because models learn from these labels, IAA acts like a ceiling—if people can’t agree, your model won’t either.
Why it matters in production: teams use IAA to accept or reject batches from vendors, compare labeling tools or workflows, and decide where to invest in guideline rewrites or extra training. It also protects evaluation: a test set with poor IAA produces noisy metrics (precision, recall, F1 Score) and misleading model choices.
How it’s measured depends on the task. For binary or multi-class tagging with two reviewers, percent agreement is easy to grasp but can be inflated by class imbalance; Cohen’s kappa adjusts for chance agreement and is a better default. With more than two reviewers or with missing labels, Krippendorff’s alpha generalizes well and supports nominal, ordinal, and continuous labels. For dense computer-vision tasks—like segmentation on medical images [DICOM annotation]—teams often compute IoU or Dice between masks and then set a threshold to decide whether two annotations “agree”.
How to read the numbers: there isn’t a universal “good” threshold. Many teams treat values around 0.6–0.8 as workable and >0.8 as strong, but context matters. Safety-critical use cases demand tighter agreement; exploratory labeling may accept lower numbers early on. Track IAA per class or attribute (colors, sizes, diagnoses) because disagreement usually hides in a few tricky categories.
Example
Two reviewers label 200 product titles as “organic cotton” (yes/no). They both say “yes” on 30 items and “no” on 150; they disagree on 20. Observed agreement is 180/200 = 0.90. Because positives are rare, expected agreement by chance is high; Cohen’s kappa ≈ (0.90 − 0.68) / (1 − 0.68) = 0.69—solid but not perfect. That gap points to where to improve: tighten the definition of “organic,” add edge-case examples (blends, recycled content), and run a short calibration round before relabeling.
In day-to-day operations, raise IAA by rewriting ambiguous guidelines, adding boundary examples, running small calibration sprints, and using an annotation workflow with maker–checker reviews and adjudication for tough cases. Sample head, torso, and long-tail items in every calibration so agreement reflects real traffic, not just the easy slice.