llmfit: Find Which LLM Models Run on Your Hardware
llmfit: Find the Right LLM for Your Hardware
Ever downloaded a 70-billion parameter model only to find it won’t fit in your GPU’s VRAM? Or wasted hours trying to figure out which quantization level will actually run on your machine? llmfit https://github.com/AlexsJones/llmfit solves this frustration by being a terminal tool that right-sizes LLM models to your system’s exact specifications.
It detects your hardware (CPU, RAM, GPU, VRAM), scores hundreds of models across quality, speed, fit, and context dimensions, and tells you exactly which ones will run well on your machine—with recommendations for the best quantization level.
With 12.2k stars on GitHub, this is quickly becoming an essential tool for anyone running local LLMs.
Key Features
🖥️ Interactive TUI
Launch the default terminal UI and get instant answers:
llmfitThe TUI shows your system specs at the top (CPU, RAM, GPU name, VRAM, backend) and displays models in a scrollable table sorted by composite score. Each row shows the model’s score, estimated tok/s, best quantization for your hardware, run mode, memory usage, and use-case category.
TUI Controls:
↑/↓orj/k- Navigate models/- Search by name, provider, params, or use caseEscorEnter- Exit search modeCtrl-U- Clear searchf- Cycle fit filter (All, Runnable, Perfect, Good, Marginal)a- Cycle availability filter (All, GGUF Avail, Installed)s- Cycle sort column (Score, Params, Mem%, Ctx, Date, Use Case)d- Download selected model directlyr- Refresh installed models from runtime providersi- Toggle installed-first sorting1-9- Toggle provider visibilityEnter- Toggle detail viewPgUp/PgDn- Scroll by 10g/G- Jump to top / bottomt- Cycle through 6 color themesp- Open Plan mode for hardware planningq- Quit
🧠 Smart Model Scoring
Each model is scored across four dimensions (0-100):
- Quality - Parameter count, model family reputation, quantization penalty, task alignment
- Speed - Estimated tokens/sec based on backend, params, and quantization
- Fit - Memory utilization efficiency (sweet spot: 50-80% of available memory)
- Context - Context window capability vs. target for the use case
Weights vary by use-case category. The composite score combines:
| Use Case | Quality | Speed | Fit | Context |
|---|---|---|---|---|
| General | 0.30 | 0.30 | 0.25 | 0.15 |
| Coding | 0.25 | 0.30 | 0.30 | 0.15 |
| Reasoning | 0.55 | 0.20 | 0.15 | 0.10 |
| Chat | 0.25 | 0.35 | 0.25 | 0.15 |
| Multimodal | 0.30 | 0.25 | 0.25 | 0.20 |
| Embedding | 0.20 | 0.30 | 0.30 | 0.20 |
🌐 Multi-Platform Hardware Detection
Automatically detects your system’s capabilities:
- NVIDIA - Multi-GPU support via
nvidia-smi. Aggregates VRAM across all GPUs. - AMD - Detected via
rocm-smi - Intel Arc - Discrete VRAM via sysfs, integrated via
lspci - Apple Silicon - Unified memory via
system_profiler. VRAM = system RAM. - Ascend - Detected via
npu-smi - CPU - Fallback for CPU-only execution
🔄 Dynamic Quantization Selection
llmfit doesn’t assume a fixed quantization. It tries the best quality quantization that fits your hardware by walking a hierarchy from Q8_0 (best quality) down to Q2_K (most compressed), picking the highest quality that fits in available memory.
🦴 MoE Support
Mixture-of-Experts models (Mixtral, DeepSeek-V2/V3) are detected automatically. Only a subset of experts is active per token, so effective VRAM is much lower than total parameter count suggests. For example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token.
🔌 Runtime Provider Integration
Supports multiple local LLM runtimes:
- Ollama - Detect installed models, download new ones directly from TUI
- llama.cpp - Direct GGUF downloads from Hugging Face + local cache detection
- MLX - Apple Silicon / mlx-community model cache
When multiple providers are available, pressing d opens a provider picker modal.
🌍 Massive Model Database
Covers 206 models from HuggingFace including:
- Meta Llama, Mistral, Qwen, Google Gemma, Microsoft Phi
- DeepSeek, IBM Granite, Allen Institute OLMo, xAI Grok, Cohere, BigCode
- 01.ai, Upstage, TII Falcon, HuggingFace, Zhipu GLM, Moonshot Kimi, Baidu ERNIE
- Code-specific: CodeLlama, StarCoder2, WizardCoder, Qwen2.5-Coder, Qwen3-Coder
- Reasoning: DeepSeek-R1, Orca-2
- Multimodal: Llama 3.2 Vision, Llama 4 Scout/Maverick, Qwen2.5-VL
- Enterprise: IBM Granite
- Embedding: nomic-embed, bge
🌐 REST API
Run llmfit as a server for cluster schedulers:
llmfit serve --host 0.0.0.0 --port 8787
# Get top runnable models for this node
curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"
# Node hardware info
curl http://localhost:8787/api/v1/system
# Full fit list with filters
curl "http://localhost:8787/api/v1/models?min_fit=marginal&runtime=llamacpp&sort=score&limit=20"🤖 OpenClaw Integration
llmfit ships as an OpenClaw skill that lets AI agents recommend hardware-appropriate local models and auto-configure Ollama, llama.cpp, or LM Studio providers. Install the skill and ask your OpenClaw agent things like “What model should I use for coding on my MacBook?” The agent will call llmfit recommend --json under the hood, interpret the results, and offer to configure your setup.
# Install the skill
./scripts/install-openclaw-skill.sh
cp -r skills/llmfit-advisor ~/.openclaw/skills/Platforms
- 🐧 Linux - Full support (NVIDIA, AMD, Intel Arc, Ascend)
- 🍎 macOS - Full support (Apple Silicon, Intel)
- 🪟 Windows - Full support
⚡ Speed Estimation
Token generation is memory-bandwidth-bound: each token requires reading the full model weights from VRAM. llmfit estimates throughput using:
(bandwidth_GB/s ÷ model_size_GB) × efficiency_factor (0.55)The efficiency factor accounts for kernel overhead, KV-cache reads, and memory controller effects. llmfit includes a bandwidth lookup table covering ~80 GPUs (NVIDIA consumer + datacenter, AMD RDNA + CDNA, Apple Silicon).
For unrecognized GPUs, it uses backend-specific speed constants:
| Backend | Speed Constant |
|---|---|
| CUDA | 220 |
| Metal | 160 |
| ROCm | 180 |
| SYCL | 100 |
| CPU (ARM) | 90 |
| CPU (x86) | 70 |
| NPU (Ascend) | 390 |
Penalties apply for CPU offload (0.5×), CPU-only (0.3×), and MoE expert switching (0.8×).
Get Started
Installation
macOS / Linux (Homebrew):
brew install llmfitWindows:
scoop install llmfitQuick Install:
curl -fsSL https://llmfit.axjns.dev/install.sh | shBuild from Source:
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --release
# Binary at target/release/llmfitCLI Mode
Prefer classic command-line output? Use --cli:
# Table of all models ranked by fit
llmfit --cli
# Only perfectly fitting models, top 5
llmfit fit --perfect -n 5
# Show detected system specs
llmfit system
# Search by name, provider, or size
llmfit search "llama 8b"
# Detailed view of a single model
llmfit info "Mistral-7B"
# Top 5 recommendations as JSON
llmfit recommend --json --limit 5
# Plan required hardware for a specific model
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192GPU Memory Override
If autodetection fails (VMs, broken nvidia-smi), manually specify VRAM:
llmfit --memory=32G
llmfit --memory=32000M🔗 GitHub: github.com/AlexsJones/llmfit
Why This Tool Rocks
- Instant Answers - No more guessing which model will run on your hardware
- Multi-GPU Support - Detects and aggregates VRAM across multiple GPUs
- MoE Intelligence - Correctly handles Mixture-of-Experts models like Mixtral
- Smart Quantization - Automatically picks the best quantization that fits your memory
- Direct Downloads - Download models directly from the TUI via Ollama or llama.cpp
- Speed Estimation - Realistic tok/s predictions based on actual hardware bandwidth
- REST API - Integrate into cluster schedulers and orchestration tools
- 6 Beautiful Themes - Dracula, Solarized, Nord, Monokai, Gruvbox, and more
- Free & Open Source - MIT licensed, no subscriptions
Crepi il lupo! 🐺