Skip to content

Jobs Tracking

Overview

oi-tracking-client can be used to track your job. Same rules are applied from the tracking modules, however there are some environment variables exposed to make the integration easier with the jobs.

How to use oi-tracking-client in your training script?

OICM seamlesly integrates the oip-tracking-client with jobs module. All the variables required for tracking are set by default:
- api_host: is set to be the host of the current environment
- api_key: internal authentication happens and no need to provide it
- workspace_name: is set to be the workspace of the running jobs

All these variables are set by default, but the user can add different configurations if he/she wants to track the job in a different environment or workspace.

Example

import os
from oip_tracking_client.tracking import TrackingClient

TrackingClient.connect()

experiment_name = "Jobs Test 4"
TrackingClient.set_experiment(experiment_name) # creates a new one if it does not exist

# rest of the code remains the same

Tracking on pods with multiple GPUs

When tracking metrics on pods with multiple GPU, it should be noted that torchrun spawns one process for each GPU.

Each process will then create its own experiment run.

To disable this use use the TrackingClient only if the environment variable GLOBAL_RANK is 0.