Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
GPUStack is an open-source GPU cluster manager built specifically for AI model deployment. Rather than asking operators to wire together Kubernetes, an inference engine, a model registry, and a load balancer by hand, GPUStack ships an opinionated all-in-one control plane that turns a pile of heterogeneous GPUs into a coherent inference platform. As of May 2026 the project has crossed 5,000 GitHub stars and is being adopted by enterprises that want a self-hosted alternative to managed model-as-a-service offerings. ## All-in-One Inference Control Plane GPUStack bundles model discovery, deployment, autoscaling, routing, observability, and an OpenAI-compatible API into a single binary plus a lightweight worker agent. Operators install the server on a control node, run the agent on each GPU host, and almost immediately get a web console that lists available accelerators, lets them deploy models from Hugging Face or local files, and exposes an /v1/chat/completions endpoint that downstream apps can use unchanged. ## Heterogeneous Hardware Support A defining feature is broad hardware coverage. GPUStack supports NVIDIA CUDA, AMD ROCm, Huawei Ascend with MindIE, Apple Silicon, and other accelerators, and can mix them in a single cluster. This is significant for organizations operating under chip-export restrictions or simply trying to make use of older hardware alongside newer chips. Workloads can be pinned to specific hardware types or routed dynamically based on cost and availability. ## Multi-Engine Backend GPUStack is not itself an inference engine. Instead, it orchestrates established engines, including vLLM, SGLang, llama.cpp, and others, choosing the most appropriate backend per model and per hardware target. This separation keeps GPUStack focused on the cluster and operations problem while still benefiting from rapid progress in upstream engines. When a new vLLM release lands, operators upgrade the engine without touching the rest of the stack. ## Distributed Inference and Multi-Node Models For models too large to fit on a single GPU or node, GPUStack supports distributed inference with tensor and pipeline parallelism across multiple hosts. The scheduler understands accelerator topology and tries to place model shards on GPUs connected by the fastest available interconnect, reducing the communication tax that dominates large-model serving. ## OpenAI-Compatible API and Model-as-a-Service Every deployed model is exposed through an OpenAI-compatible REST API, with API key management, per-key rate limits, and usage metering. This effectively gives an organization its own internal model-as-a-service platform, where teams can self-serve model deployments and consume them with any OpenAI SDK without code changes. Built-in dashboards track requests, tokens, latency, and GPU utilization across the cluster. ## Operational Niceties GPUStack pays attention to the unglamorous parts of running an AI platform: rolling upgrades, graceful drain for maintenance, integrated logging, model checksum verification, and offline installation modes for air-gapped environments. The Apache 2.0 license and self-hosted design make it a defensible choice for regulated industries that cannot send traffic to public model APIs. ## Limitations Because GPUStack is opinionated, teams with already-mature Kubernetes-based ML platforms may find it overlapping with components they have already built. Some advanced features available in dedicated inference platforms, such as fine-grained KV-cache-aware routing, are still simpler in GPUStack than in specialist tools like AIBrix or Mooncake. As an actively evolving project, API stability and feature parity across hardware vendors continue to improve, so adopters should track release notes carefully when upgrading.