Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
## Introduction SkyPilot is an open-source framework from UC Berkeley's Sky Computing Lab that provides a unified interface for running AI workloads across Kubernetes, Slurm, and 20+ cloud providers. With 9,700 stars on GitHub and backed by Apache 2.0 licensing, it has become the go-to solution for teams that need to train and serve models across heterogeneous infrastructure without rewriting their deployment scripts for each provider. The core problem SkyPilot solves is infrastructure fragmentation. As GPU availability fluctuates across clouds and on-premise clusters, AI teams waste significant engineering time adapting workloads to different providers, managing spot instance preemptions, and manually optimizing for cost. SkyPilot abstracts all of this behind a single YAML specification, automatically selecting the cheapest available infrastructure and handling failover transparently. ## Architecture and Design SkyPilot operates as a control plane that sits between your AI workloads and the underlying compute providers. It translates a portable workload specification into provider-specific API calls, handling provisioning, monitoring, and lifecycle management. | Component | Purpose | Key Details | |-----------|---------|-------------| | Task Specification | Workload definition | YAML or Python API for resources, environment, and commands | | Optimizer | Cost/availability optimization | Selects cheapest provider meeting resource requirements | | Provisioner | Infrastructure management | Auto-provisions VMs, Kubernetes pods, or Slurm jobs | | Job Controller | Execution management | Queuing, monitoring, auto-recovery, and spot preemption handling | | Sky Serve | Model serving | Auto-scaling inference endpoints across regions and clouds | The **optimizer** is SkyPilot's central intelligence. Given a task's resource requirements (GPU type, count, memory, disk), it queries real-time pricing and availability across all configured providers and selects the optimal placement. If the chosen instance is preempted or fails, the system automatically migrates to the next best option — potentially on a completely different cloud provider — without user intervention. **Sky Serve** extends this to model serving, providing auto-scaling inference endpoints that can span multiple clouds and regions. This enables teams to serve models with geographic redundancy and cost optimization that would be extremely complex to build manually. ## Key Features **True Multi-Cloud Portability**: A single YAML file defines your workload, and SkyPilot runs it on whichever provider offers the best price and availability. Supported providers include AWS, GCP, Azure, OCI, CoreWeave, Lambda Cloud, RunPod, Nebius, Fluidstack, and many more — plus any Kubernetes cluster. **3-6x Cost Savings with Managed Spot**: SkyPilot makes spot/preemptible instances practical for training by automatically checkpointing, detecting preemptions, and relaunching on the cheapest available spot capacity. Teams report 3-6x cost savings compared to on-demand instances. **Kubernetes-Native Developer Experience**: When running on Kubernetes, SkyPilot provides SSH access, live code syncing, and IDE integration on top of standard pods. Developers get a familiar VM-like workflow without leaving the Kubernetes ecosystem. **Automatic Failover and Recovery**: If a node fails, an instance gets preempted, or a cloud region experiences an outage, SkyPilot automatically recovers the job on alternative infrastructure. For long-running training jobs, this eliminates the manual babysitting that typically consumes engineering time. **Job Queue and Resource Management**: Built-in job queuing allows teams to submit multiple training runs and let SkyPilot schedule them across available capacity. Auto-stop policies ensure idle resources are terminated, preventing cost overruns. ## Code Example ```yaml # sky_task.yaml - Train a model on cheapest GPU resources: accelerators: A100:4 use_spot: true setup: | pip install torch transformers git clone https://github.com/my-org/my-model.git run: | cd my-model torchrun --nproc_per_node=4 train.py \ --batch_size 32 \ --checkpoint_dir /checkpoint ``` ```bash # Launch on cheapest available cloud sky launch sky_task.yaml # Check status sky status # Serve a model with auto-scaling sky serve up serve.yaml ``` ## Limitations SkyPilot requires initial configuration of cloud credentials for each provider, which can be time-consuming for organizations with strict IAM policies. The optimizer's pricing data depends on provider APIs that occasionally lag behind actual pricing changes. While Kubernetes support is strong, some advanced Kubernetes features (custom schedulers, specialized operators) may require workarounds. The job recovery mechanism relies on checkpointing, so workloads that don't implement checkpointing won't benefit from automatic spot recovery. Multi-node distributed training requires careful networking configuration, particularly across cloud providers. Finally, the abstraction layer inevitably means some provider-specific features are not exposed through SkyPilot's interface. ## Who Should Use This SkyPilot is essential for AI teams spending significant time and money managing GPU infrastructure across multiple providers. Organizations training large models that want to leverage spot instances without the operational burden of manual preemption handling will see immediate ROI. Platform engineering teams building internal ML platforms can use SkyPilot as the compute layer, providing researchers with self-service GPU access while maintaining cost control. Startups that need GPU flexibility — using Lambda Cloud when cheap, falling back to GCP when not — will find SkyPilot's multi-cloud optimization particularly valuable. Research labs running many experiments in parallel will benefit from the job queue and automatic resource management.