Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

whichllm - Open Source | Evermx | Evermx

Back to Open Source

Trending

whichllm

Andyyyy64MIT

View on GitHub

Inference4.2K Stars231 Forks52 views

whichllm is a command-line tool that recommends the best open-source local LLM for your specific hardware, ranked by real, recency-aware benchmarks rather than parameter count or VRAM fit alone. With 4,150 GitHub stars under an MIT license, it cuts through the noise of the local-LLM landscape — where dozens of new models ship every week and benchmarks are scattered across LiveBench, Artificial Analysis, Aider, Chatbot Arena ELO, and the Open LLM Leaderboard — and produces a single ranked list that says: on your machine, run this model, in this quantization, with this expected speed. ## The Problem It Solves Most local-LLM tooling answers the wrong question. Ollama tells you which models exist. llama.cpp tells you whether a quantization will load. HuggingFace tells you what's popular. None of them tell you which of the 5,000+ open-weight checkpoints actually runs well on the GPU sitting in your desktop, and which of the dozens of quantization variants of that checkpoint is the right trade-off between quality, VRAM, and tokens per second. whichllm is the first tool in the space to treat this as a search-and-ranking problem with real evidence, not a heuristic over file size. ## How the Ranking Engine Works The scoring engine merges benchmark data from six leaderboards and resolves evidence through five tiers of decreasing confidence: direct model-ID matches get full weight, variants are accepted with suffix-stripping, base-model inheritance applies interpolation, family-level extrapolation is size-aware, and self-reported claims are heavily discounted. Scores are tagged with confidence markers — a tilde for interpolated values, `!sr` for self-reported, a question mark for missing data — so users can see at a glance which numbers to trust. Critically, the system actively rejects fabricated claims and refuses to let small forks inherit the benchmarks of much larger base models, which is the failure mode that makes leaderboards look misleading. On top of benchmark quality, the ranker accounts for model size on a log2 scale, applies quantization penalties calibrated per format (Q4_K_M, Q5_K_M, AWQ, GPTQ, FP16), and adjusts for runtime fit type — whether the model fits fully in VRAM, partially with CPU offload, or only in system RAM. Speed estimates derive from GPU memory bandwidth, the active-parameter ratio for MoE architectures, and backend-specific throughput factors. Recency-aware demotion prevents stale leaderboards from advantaging older model generations that have since been surpassed. ## Hardware Detection and Backends The hardware detection layer uses nvidia-ml-py for NVIDIA cards, native Metal queries for Apple Silicon, and ROCm for AMD. VRAM estimation accounts for weights, KV cache, activation memory, and a ~500MB framework overhead, which means the recommendations don't fall over when the user actually tries to run the model. Supported formats include GGUF via llama-cpp-python (the universal path), AWQ/GPTQ via transformers with their respective inference libraries, and FP16/BF16. Apple Silicon and CPU-only systems are restricted to GGUF for stability; Linux with NVIDIA GPUs unlock the broader format set. Ollama integration is handled by piping JSON output to a script that maps HuggingFace IDs to local model names — a clean composition rather than a duplicated model catalog. ## Practical CLI Surface The command set is small and useful. `whichllm` alone gives auto-detected recommendations for the current machine. `whichllm --gpu "RTX 4090"` simulates an upgrade. `whichllm plan "llama 3 70b"` does reverse lookup — what hardware do I need to run this model. `whichllm upgrade "RTX 4090" "RTX 5090"` produces a comparative analysis useful for purchase decisions. `whichllm run` launches an interactive chat session, automatically selecting the optimal GGUF variant. `whichllm snippet` emits Python integration code. All commands accept filters for evidence level, quantization, minimum speed threshold, and task profile (general, coding, vision, math). ## Position in the Local-LLM Ecosystem whichllm sits one layer above the inference engines and one layer below the model registries, and that's a real gap in the open-source stack. It's the right tool to run before installing Ollama or LM Studio, and it's the right tool to run again whenever you upgrade a GPU. Built with Python 3.11+, Typer for the CLI, and Rich for output formatting, it's been in active development since March 2026 and has hit 4,150 stars within months — fast adoption for a CLI tool with no marketing surface. The MIT license and the explicit emphasis on transparent hardware reporting from users suggest the project is positioned to become the default recommendation engine for the local-LLM community rather than a vendor-aligned tool.

Key Features

Recommends local LLMs ranked by real benchmarks, not parameter count
Merges LiveBench, Artificial Analysis, Aider, Chatbot Arena ELO, Open LLM Leaderboard
Five-tier evidence resolution with confidence-tagged scores (~, !sr, ?)
Hardware auto-detection for NVIDIA, AMD, Apple Silicon, and CPU-only
VRAM estimation accounting for weights, KV cache, activations, and framework overhead
GPU simulation and reverse hardware lookup (whichllm plan, whichllm upgrade)
Interactive chat via whichllm run with automatic GGUF variant selection
MoE-aware speed estimates using GPU memory bandwidth and active-parameter ratio
Supports GGUF, AWQ, GPTQ, FP16, BF16 with backend-appropriate routing
JSON output for Ollama integration and scripting

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT313

Open Source

whichllm

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth