Skip to main content

Model Performance

In machine learning, model performance describes how well your model predicts on new, unseen data. For computer vision tasks, performance is measured by metrics such as F1 score, IoU (Intersection over Union), or other domain-specific criteria. This guide will help you understand these concepts and how to improve and track your models via hyperparameter sweeps, monitoring tools like Weights & Biases, and Protege’s automatic W&B integration.


1. Introduction to Key Performance Metrics

1.1 F1 Score

F1 score is the harmonic mean of precision and recall. It’s often used in classification tasks (including object detection, where each detected box can be seen as a classification event—true positives, false positives, etc.).

  • Precision: How many of your predicted positives are truly positive.
  • Recall: How many of the actual positives you captured.
F1=2precisionrecallprecision+recallF1 = 2 * \frac{precision * recall}{precision + recall}

Why F1?
If either precision or recall is low, F1 remains low. It helps if you want a balance between capturing as many positives as possible (recall) and reducing false positives (precision).

Further Reading:

1.2 Intersection over Union (IoU)

IoU is commonly used in object detection and segmentation. It measures the overlap between the predicted bounding box (or mask) and the ground-truth box (or mask).

IoU=Area of OverlapArea of UnionIoU = \frac{Area\ of\ Overlap}{Area\ of\ Union}
  • Overlap = Intersection area of predicted and ground-truth region
  • Union = Combined area of both regions

An IoU threshold (e.g., 0.5) is often used to decide if a detection is a true positive. For segmentation, a per-pixel IoU can measure how well a model’s predicted mask lines up with the actual object shape.

Further Reading:

1.3 Other Metrics

  • mAP (Mean Average Precision): Aggregates precision-recall across multiple classes or IoU thresholds.
  • Pixel Accuracy / Dice Coefficient: For segmentation tasks.
  • Recall @ k: For ranking or retrieval-based tasks.

2. Tracking Performance in Weights & Biases

Weights & Biases (W&B) is a popular experiment tracking tool. When you train a model with frameworks like PyTorch or using Protege, you can log performance metrics (loss curves, F1, IoU, mAP, etc.) to W&B in real time.

2.1 Protege’s Automatic W&B Integration

You can enable W&B logging directly from your Protege job spec:

[wandb]
project = "my-vision-project"
name = "experiment-42"
  • Protege automatically logs training loss, validation metrics, and predictions (if configured).
  • Runs are grouped by W&B project and can be compared visually.

Why W&B?

  • Centralized dashboard to manage all training runs.
  • Overlay and compare runs with different hyperparameters.
  • Beautiful charts and filters for parameter/metric tracking.

3. Hyperparameter Sweeping

A hyperparameter sweep systematically tries different configurations (learning rate, batch size, optimizer type, etc.) to find optimal combinations. Protege supports this:

  1. Define sweep settings in your job spec
  2. Launch jobs in parallel
  3. Compare results in W&B

Why Sweeps?

  • Avoid trial-and-error guesswork
  • Automatically discover combinations that improve performance

Further Reading:


4. Improving Model Performance

4.1 Data Quality

  • Garbage in, garbage out: If your labeled data is inconsistent, no metric optimization will help.
  • Use a strong labeling guide, confirm your bounding boxes or segmentation masks are accurate.

4.2 Model Architecture

  • Sometimes a deeper or more specialized architecture (e.g., maskrcnn_resnet101 vs. maskrcnn_resnet50) might yield better IoU.

4.3 Augmentations

  • For robust performance, especially with limited data, consider augmentations (random crops, flips, brightness changes, etc.).
  • Monitor if augmentation actually helps or hinders by checking your F1/IoU in W&B.

4.4 Balanced Classes

  • If you have extremely imbalanced classes, consider weighted loss or oversampling.

5. Linking Performance to Production Filters

Once a high-performance model is found:

  1. Export the final artifact (weights).
  2. Bundle it into your filter Docker image or load it at runtime.
  3. Monitor real-world usage—sometimes the real data distribution differs from training!

Key: Observing “model drift” or performance degradation in production might require collecting new data and re-running sweeps.


6. Additional Resources


Summary

  • F1, Precision, Recall, and IoU are critical metrics to gauge a model’s success in classification, detection, or segmentation.
  • Weights & Biases integration (especially with Protege) helps you track, visualize, and compare these metrics over multiple runs.
  • Hyperparameter Sweeps systematically explore parameter combos for the best local optimum.
  • Combine strong data curation, architecture choices, and augmentations to push performance to a production-ready level.
  • Once validated, integrate the final model into your filter pipeline, always prepared to gather new data and refine again.