Navarch

Open-source GPU fleet management

Navarch automates provisioning, health monitoring, and lifecycle management of GPU nodes across cloud providers.

Health Monitoring

Detect GPU failures in real time. Catches XID errors, thermal issues, ECC faults, and NVLink failures via NVML before they crash your workloads.
Auto-Replacement

Unhealthy nodes get terminated and replaced automatically. Define health policies with CEL expressions. Your pool stays at capacity.
Multi-Cloud

Provision across Lambda Labs, GCP, and AWS from a single config. Failover between providers or optimize for cost.
Autoscaling

Scale based on GPU utilization, queue depth, schedules, or predictions. Cooldown prevents thrashing. Combine multiple strategies.
Pool Management

Group nodes by instance type, region, or workload. Set scaling limits, health policies, and labels per pool.
Simulator

Test policies and failure scenarios locally. Stress test with 1000+ simulated nodes before deploying to production.

Why Navarch#

GPUs fail. Cloud providers give you instances, but detecting hardware failures and replacing bad nodes is your problem. Teams end up building custom monitoring with DCGM, dmesg parsing, and cloud-specific scripts. Then there's the multi-cloud problem: different APIs, different instance types, different tooling.

Navarch makes your GPU supply self-healing and fungible across clouds, all under one system to manage it all:

Unified health monitoring for XID errors, thermal events, ECC faults, and NVLink
Automatic replacement when nodes fail health checks
Source GPUs anywhere. Lambda out of H100s? Failover to GCP or AWS automatically.
Single control plane for Lambda, GCP, and AWS. One config, one API.
Works with your scheduler. Kubernetes, SLURM, or bare metal.

How it works#

Navarch architecture

The control plane manages pools, evaluates health policies, and provisions or terminates instances through cloud provider APIs.

The node agent runs on each GPU instance. It reports health via NVML, sends heartbeats, and executes commands from the control plane.

Navarch complements your existing scheduler. It handles infrastructure; your scheduler places workloads.

Quick look#

# navarch.yaml
providers:
  lambda:
    type: lambda
    api_key_env: LAMBDA_API_KEY

pools:
  training:
    provider: lambda
    instance_type: gpu_8x_h100_sxm5
    region: us-west-1
    min_nodes: 2
    max_nodes: 8
    health:
      auto_replace: true
    autoscaling:
      type: reactive
      scale_up_at: 80
      scale_down_at: 20

control-plane -config navarch.yaml

Next steps#

Getting Started

Set up Navarch with Lambda Labs.

Getting started
Core Concepts

Pools, providers, health checks, node lifecycle.

Concepts
Configuration

Full reference for navarch.yaml.

Configuration
Architecture

How Navarch integrates with your stack.

Architecture