What is F1 Score?

F1 Score is a single number that balances precision (how many of your positive predictions were correct) and recall (how many of the true positives you actually found). It’s the harmonic mean of the two, so it only gets high when both are high:F1 = 2 × (precision × recall) / (precision + recall).

Teams prefer F1 over plain accuracy when classes are imbalanced (e.g., only a small fraction of items are truly “positive”), because accuracy can look great while the model quietly misses most positives.

In everyday use, you pick a decision threshold (for a classifier’s score) and report precision, recall, and F1 at that threshold. If the business cost of false positives is high (e.g., auto-approving refunds), you might tune for higher precision; if missing positives is worse (e.g., fraud or safety), you push recall up.

F1 provides a neutral middle ground when the costs are comparable. For multi-class or multi-label problems, teams either average per-class F1 equally (macro F1) or weight by support to reflect class frequency (micro/weighted F1). In pixel tasks like medical image segmentation [DICOM annotation], the Dice coefficient commonly reported is mathematically equivalent to F1 measured over pixels.

F1 depends on label quality. Inconsistent or incomplete labels depress both precision and recall, making F1 look worse (or misleadingly better) than the model truly deserves. A tight annotation workflow—clear guidelines, adjudication of disagreements, and periodic audits—keeps the metric trustworthy and comparable across releases.

Example

Suppose you’re tagging products that truly have “organic cotton” in their specs. There are 120 real positives in your test set. Your model predicts 90 positives; 72 are correct. That gives precision = 72/90 = 0.80 and recall = 72/120 = 0.60. Plugging in: F1 = 2 × (0.80 × 0.60) / (0.80 + 0.60) = 0.69. If you raise the threshold to cut false positives, precision might climb to 0.87 while recall drops to 0.52; F1 would fall to ~0.65, telling you the stricter setting found fewer real positives overall.

‍