Skip to main content

Protege User Guide

Overview

This document walks you through how to:

  1. Install and Set Up Protege
  2. Create a Training Job Spec
  3. Configure Your Cloud Environment
  4. Launch a Training Job (single-run or sweep)
  5. Monitor & Debug
  6. Export & Use Models

1. Introduction

Protege is a suite of tools for training computer vision models using cloud resources (like GCP). You define everything in a TOML "job spec," specifying the dataset, model architecture, training parameters, and export settings. Protege then launches a pipeline to orchestrate data ingestion, model training, and final artifact export.

Key features:

  • Supports multiple task types (object detection, instance segmentation, classification, keypoint detection, semantic segmentation).
  • Integrates with Encord for dataset management.
  • Leverages GCP for GPU training (with flexible accelerators).
  • Exports models for easy consumption by model_runtime or third-party pipelines.

Protege architecture


2. Installation & Setup

  1. Clone the protege-ml repository
git clone https://github.com/PlainsightAI/protege-ml.git
cd protege-ml
  1. Install Dependencies
    Make sure you have Python 3.10+ installed. Then install the requirements:
pip install -e .
# or
pip install -r requirements.txt

(Adjust for your environment—conda, venv, etc.)

  1. Authentication
  • If you plan to run training on GCP, ensure you have application-default credentials set:
gcloud auth application-default login
  • For Encord dataset usage, you will need an Encord API key. Make sure it’s properly referenced in your environment or the job spec (if required by your internal workflow).

3. Understanding the Job Specification

Protege uses TOML files to specify training configurations. Here’s an example breakdown of a typical spec file, referencing sample_job_spec.toml:

[dataset]
source = "encord"
config.project_hash = "b1269f2e-f7e5-46c7-9ba9-39f06cbbb9ce"

[sweep]
run = false

[model]
task = "instance_segmentation"
architecture = "maskrcnn_resnet50_fpn_v2"

[training]
duration = 60
batch_size = 4
num_workers = 4
trainable_backbone_layers = 1
optimizer.type = "sgd"
optimizer.learning_rate = 0.01
augmentations.training = ""
augmentations.inference = ""
scheduler = ""

[cloud_provider]
platform = "GCP"
config.project = "alcibiades-dev"
config.location = "us-central1"
config.disk_size = 500
config.accelerator = "NVIDIA_TESLA_V100"
config.machine_type = "n1-highmem-8"

[export]
path = "gs://dummy-runs/models/model.zip"

[wandb]
project = "pipeline-sweeps-2"

Sections at a Glance

[dataset]
  • source: Where data originates (currently encord)
  • config.project_hash: The Encord project hash
[sweep]
  • run: Enables sweep mode
  • num_sweep_workers: Number of jobs
  • epochs_per_sweep: Epochs per run
[model]
  • task: Task type (e.g., instance_segmentation)
  • architecture: Model architecture (e.g., maskrcnn_resnet50_fpn)
[training]
  • duration, batch_size, num_workers
  • optimizer & scheduler
  • augmentations
[cloud_provider]
  • platform, config.project, config.accelerator, etc.
[export]
  • path: Where the final model is saved (GCS/local)
[wandb]
  • project: Weights & Biases project name

4. Single vs. Sweep Runs

Sweep vs. Single run

  • Single-Run: Set [sweep].run = false. Only one training job is executed.
  • Sweep: Set [sweep].run = true and configure sweep workers and epochs.

Example:

[sweep]
run = true
num_sweep_workers = 10
epochs_per_sweep = 5

This spawns 10 parallel jobs for 5 epochs each.


5. Launching a Training Job

With your spec ready:

python -m protege.pipelines path/to/sample_job_spec.toml [--enable_cache]

Options:

  • --enable_cache: Use local cache to speed up reruns

Logs will show:

  • GCP instance creation
  • Data pulling from Encord
  • Training progress

Make sure you're authenticated with GCP and Encord.


6. Monitoring & Debugging

  • Console Logs: Real-time job output
  • Weights & Biases: If enabled, tracks metrics
  • GCP Console: Logs for VMs or jobs
  • Failure Logs: Helpful for debugging (e.g., OOM errors)

7. Exporting & Using Models

Export

Controlled by the [export].path field:

[export]
path = "gs://my-model-bucket/experiments/run1/model.zip"

The trained model will be uploaded there automatically.

Consume

from model_runtime import Runtime, Backend

runtime = Runtime("gs://my-model-bucket/experiments/run1/model.zip", backend=Backend.TORCHSCRIPT)
preds = runtime.inference([image_tensor])

Use model_runtime or your own logic to run inference.


8. Best Practices & Tips

  • Try T4 GPUs first (cheaper than V100/A100)
  • Lower batch size for memory errors
  • Clean datasets are critical (well-labeled, diverse)
  • Use W&B for metrics and visualizations
  • Start small with sweeps, then scale up
  • Version control your .toml specs

9. FAQ

Can I do local CPU training?
Yes, but it's slower. Disable GCP logic accordingly.

Where do I set epochs?
In sweep.epochs_per_sweep or with a custom training param.

Can I run multiple models from one spec?
Not directly—create multiple spec files for different models.

Is there a debug mode?
Use --enable_cache and a small dataset for quick tests.


10. Next Steps

  • Try a minimal example with small data
  • Scale up to GCP jobs with real datasets
  • Integrate the exported model into a filter or runtime pipeline

Conclusion

Protege gives you a powerful, TOML-driven pipeline for training computer vision models in the cloud. Define, train, monitor, and export—all from one spec. With proper dataset hygiene, GCP setup, and W&B integration, you’ll be able to produce reproducible models that plug straight into your inference pipeline or filter stack.

Need advanced help? Explore the model_runtime repo, ask internally, or dive into the full documentation for deeper customization!