Skip to content

Fundamentals

Overview

The Jobs Management module offers a comprehensive solution for deploying and managing AI jobs across Kubernetes and SLURM environments. This module is designed to streamline the execution of AI tasks, ensuring efficient resource utilization and seamless operations across diverse computational setups.

Key Features

  • Diverse Job Initiation Options:

    • Script-Based Submission: Upload scripts through the UI for defining AI tasks.
    • Docker Image Submission: Deploy jobs using Docker images containing all necessary code and dependencies.
  • Comprehensive Job Configuration: Configure job dependencies, runtime parameters, and execution constraints to meet specific performance standards.

  • Advanced Resource Specification: Specify necessary computational resources such as CPUs, GPUs, and memory to optimize each job's performance.

  • Robust Job Submission and Scheduling:

    • Gang Scheduling: Allocates all necessary resources before starting the job to prevent partial execution.
    • Priority-Based Scheduling: Customizes job priorities, adapting scheduling policies to various operational needs.
  • Real-Time Monitoring and Management: Provides tools for real-time tracking of CPU, GPU, and memory usage, and detailed logs for job execution tracking and performance analysis.

  • Proactive Alerting and Notifications: Sophisticated system for issuing alerts about critical conditions or anomalies, enabling timely interventions.

  • Streamlined Completion and Post-Processing: Manages all post-processing tasks automatically, including data aggregation and cleanup.

  • Flexible Result Storage and Access: Supports configurable storage solutions for securely storing and easily accessing job outputs.

  • Cross-Platform Compatibility: Manages AI jobs effectively within both Kubernetes and SLURM environments, ensuring a unified job management approach.

The Jobs module also supports distributed training with Ray - to learn more check a separate page