Jobs Tracking
Overview
oi-tracking-client can be used to track your job. Same rules are applied from the tracking modules, however there are some environment variables exposed to make the integration easier with the jobs.
How to use oi-tracking-client in your training script?
OICM seamlesly integrates the oip-tracking-client with jobs module. All the variables required for tracking are set by default:
- api_host: is set to be the host of the current environment
- api_key: internal authentication happens and no need to provide it
- workspace_name: is set to be the workspace of the running jobs
All these variables are set by default, but the user can add different configurations if he/she wants to track the job in a different environment or workspace.
Example
import os
from oip_tracking_client.tracking import TrackingClient
TrackingClient.connect()
experiment_name = "Jobs Test 4"
TrackingClient.set_experiment(experiment_name) # creates a new one if it does not exist
# rest of the code remains the same
Tracking on pods with multiple GPUs
When tracking metrics on pods with multiple GPU, it should be noted that torchrun spawns one process for each GPU.
Each process will then create its own experiment run.
To disable this use use the TrackingClient
only if the environment variable GLOBAL_RANK
is 0.