Protege User Guide

Overview

This document walks you through how to:

Install and Set Up Protege
Create a Training Job Spec
Configure Your Cloud Environment
Launch a Training Job (single-run or sweep)
Monitor & Debug
Export & Use Models

1. Introduction

Protege is a suite of tools for training computer vision models using cloud resources (like GCP). You define everything in a TOML "job spec," specifying the dataset, model architecture, training parameters, and export settings. Protege then launches a pipeline to orchestrate data ingestion, model training, and final artifact export.

Key features:

Supports multiple task types (object detection, instance segmentation, classification, keypoint detection, semantic segmentation).
Integrates with Encord for dataset management.
Leverages GCP for GPU training (with flexible accelerators).
Exports models for easy consumption by model_runtime or third-party pipelines.

Protege architecture

2. Installation & Setup

Clone the protege-ml repository

git clone https://github.com/PlainsightAI/protege-ml.git
cd protege-ml

Install Dependencies
Make sure you have Python 3.10+ installed. Then install the requirements:

pip install -e .
# or
pip install -r requirements.txt

(Adjust for your environment—conda, venv, etc.)

Authentication

If you plan to run training on GCP, ensure you have application-default credentials set:

gcloud auth application-default login

For Encord dataset usage, you will need an Encord API key. Make sure it’s properly referenced in your environment or the job spec (if required by your internal workflow).

3. Understanding the Job Specification

Protege uses TOML files to specify training configurations. Here’s an example breakdown of a typical spec file, referencing sample_job_spec.toml:

[dataset]
source = "encord"
config.project_hash = "b1269f2e-f7e5-46c7-9ba9-39f06cbbb9ce"

[sweep]
run = false

[model]
task = "instance_segmentation"
architecture = "maskrcnn_resnet50_fpn_v2"

[training]
duration = 60
batch_size = 4
num_workers = 4
trainable_backbone_layers = 1
optimizer.type = "sgd"
optimizer.learning_rate = 0.01
augmentations.training = ""
augmentations.inference = ""
scheduler = ""

[cloud_provider]
platform = "GCP"
config.project = "alcibiades-dev"
config.location = "us-central1"
config.disk_size = 500
config.accelerator = "NVIDIA_TESLA_V100"
config.machine_type = "n1-highmem-8"

[export]
path = "gs://dummy-runs/models/model.zip"

[wandb]
project = "pipeline-sweeps-2"

Sections at a Glance

[dataset]

source: Where data originates (currently encord)
config.project_hash: The Encord project hash

[sweep]

run: Enables sweep mode
num_sweep_workers: Number of jobs
epochs_per_sweep: Epochs per run

[model]

task: Task type (e.g., instance_segmentation)
architecture: Model architecture (e.g., maskrcnn_resnet50_fpn)

[training]

duration, batch_size, num_workers
optimizer & scheduler
augmentations

[cloud_provider]

platform, config.project, config.accelerator, etc.

[export]

path: Where the final model is saved (GCS/local)

[wandb]

project: Weights & Biases project name

4. Single vs. Sweep Runs

Sweep vs. Single run

Single-Run: Set [sweep].run = false. Only one training job is executed.
Sweep: Set [sweep].run = true and configure sweep workers and epochs.

Example:

[sweep]
run = true
num_sweep_workers = 10
epochs_per_sweep = 5

This spawns 10 parallel jobs for 5 epochs each.

5. Launching a Training Job

With your spec ready:

python -m protege.pipelines path/to/sample_job_spec.toml [--enable_cache]

Options:

--enable_cache: Use local cache to speed up reruns

Logs will show:

GCP instance creation
Data pulling from Encord
Training progress

Make sure you're authenticated with GCP and Encord.

6. Monitoring & Debugging

Console Logs: Real-time job output
Weights & Biases: If enabled, tracks metrics
GCP Console: Logs for VMs or jobs
Failure Logs: Helpful for debugging (e.g., OOM errors)

7. Exporting & Using Models

Export

Controlled by the [export].path field:

[export]
path = "gs://my-model-bucket/experiments/run1/model.zip"

The trained model will be uploaded there automatically.

Consume

from model_runtime import Runtime, Backend

runtime = Runtime("gs://my-model-bucket/experiments/run1/model.zip", backend=Backend.TORCHSCRIPT)
preds = runtime.inference([image_tensor])

Use model_runtime or your own logic to run inference.

8. Best Practices & Tips

Try T4 GPUs first (cheaper than V100/A100)
Lower batch size for memory errors
Clean datasets are critical (well-labeled, diverse)
Use W&B for metrics and visualizations
Start small with sweeps, then scale up
Version control your .toml specs

9. FAQ

Can I do local CPU training?
Yes, but it's slower. Disable GCP logic accordingly.

Where do I set epochs?
In sweep.epochs_per_sweep or with a custom training param.

Can I run multiple models from one spec?
Not directly—create multiple spec files for different models.

Is there a debug mode?
Use --enable_cache and a small dataset for quick tests.

10. Next Steps

Try a minimal example with small data
Scale up to GCP jobs with real datasets
Integrate the exported model into a filter or runtime pipeline

Conclusion

Protege gives you a powerful, TOML-driven pipeline for training computer vision models in the cloud. Define, train, monitor, and export—all from one spec. With proper dataset hygiene, GCP setup, and W&B integration, you’ll be able to produce reproducible models that plug straight into your inference pipeline or filter stack.

Need advanced help? Explore the model_runtime repo, ask internally, or dive into the full documentation for deeper customization!

Overview​

1. Introduction​

2. Installation & Setup​

3. Understanding the Job Specification​

Sections at a Glance​

4. Single vs. Sweep Runs​

5. Launching a Training Job​

6. Monitoring & Debugging​

7. Exporting & Using Models​

Export​

Consume​

8. Best Practices & Tips​

9. FAQ​

10. Next Steps​

Conclusion​