llmfit: Find Which LLM Models Run on Your Hardware

⬅️ Back to Tools

llmfit: Find the Right LLM for Your Hardware

Ever downloaded a 70-billion parameter model only to find it won’t fit in your GPU’s VRAM? Or wasted hours trying to figure out which quantization level will actually run on your machine? llmfit https://github.com/AlexsJones/llmfit solves this frustration by being a terminal tool that right-sizes LLM models to your system’s exact specifications.

It detects your hardware (CPU, RAM, GPU, VRAM), scores hundreds of models across quality, speed, fit, and context dimensions, and tells you exactly which ones will run well on your machine—with recommendations for the best quantization level.

With 12.2k stars on GitHub, this is quickly becoming an essential tool for anyone running local LLMs.

Key Features

🖥️ Interactive TUI

Launch the default terminal UI and get instant answers:

llmfit

The TUI shows your system specs at the top (CPU, RAM, GPU name, VRAM, backend) and displays models in a scrollable table sorted by composite score. Each row shows the model’s score, estimated tok/s, best quantization for your hardware, run mode, memory usage, and use-case category.

TUI Controls:

  • / or j / k - Navigate models
  • / - Search by name, provider, params, or use case
  • Esc or Enter - Exit search mode
  • Ctrl-U - Clear search
  • f - Cycle fit filter (All, Runnable, Perfect, Good, Marginal)
  • a - Cycle availability filter (All, GGUF Avail, Installed)
  • s - Cycle sort column (Score, Params, Mem%, Ctx, Date, Use Case)
  • d - Download selected model directly
  • r - Refresh installed models from runtime providers
  • i - Toggle installed-first sorting
  • 1-9 - Toggle provider visibility
  • Enter - Toggle detail view
  • PgUp / PgDn - Scroll by 10
  • g / G - Jump to top / bottom
  • t - Cycle through 6 color themes
  • p - Open Plan mode for hardware planning
  • q - Quit

🧠 Smart Model Scoring

Each model is scored across four dimensions (0-100):

  • Quality - Parameter count, model family reputation, quantization penalty, task alignment
  • Speed - Estimated tokens/sec based on backend, params, and quantization
  • Fit - Memory utilization efficiency (sweet spot: 50-80% of available memory)
  • Context - Context window capability vs. target for the use case

Weights vary by use-case category. The composite score combines:

Use CaseQualitySpeedFitContext
General0.300.300.250.15
Coding0.250.300.300.15
Reasoning0.550.200.150.10
Chat0.250.350.250.15
Multimodal0.300.250.250.20
Embedding0.200.300.300.20

🌐 Multi-Platform Hardware Detection

Automatically detects your system’s capabilities:

  • NVIDIA - Multi-GPU support via nvidia-smi. Aggregates VRAM across all GPUs.
  • AMD - Detected via rocm-smi
  • Intel Arc - Discrete VRAM via sysfs, integrated via lspci
  • Apple Silicon - Unified memory via system_profiler. VRAM = system RAM.
  • Ascend - Detected via npu-smi
  • CPU - Fallback for CPU-only execution

🔄 Dynamic Quantization Selection

llmfit doesn’t assume a fixed quantization. It tries the best quality quantization that fits your hardware by walking a hierarchy from Q8_0 (best quality) down to Q2_K (most compressed), picking the highest quality that fits in available memory.

🦴 MoE Support

Mixture-of-Experts models (Mixtral, DeepSeek-V2/V3) are detected automatically. Only a subset of experts is active per token, so effective VRAM is much lower than total parameter count suggests. For example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token.

🔌 Runtime Provider Integration

Supports multiple local LLM runtimes:

  • Ollama - Detect installed models, download new ones directly from TUI
  • llama.cpp - Direct GGUF downloads from Hugging Face + local cache detection
  • MLX - Apple Silicon / mlx-community model cache

When multiple providers are available, pressing d opens a provider picker modal.

🌍 Massive Model Database

Covers 206 models from HuggingFace including:

  • Meta Llama, Mistral, Qwen, Google Gemma, Microsoft Phi
  • DeepSeek, IBM Granite, Allen Institute OLMo, xAI Grok, Cohere, BigCode
  • 01.ai, Upstage, TII Falcon, HuggingFace, Zhipu GLM, Moonshot Kimi, Baidu ERNIE
  • Code-specific: CodeLlama, StarCoder2, WizardCoder, Qwen2.5-Coder, Qwen3-Coder
  • Reasoning: DeepSeek-R1, Orca-2
  • Multimodal: Llama 3.2 Vision, Llama 4 Scout/Maverick, Qwen2.5-VL
  • Enterprise: IBM Granite
  • Embedding: nomic-embed, bge

🌐 REST API

Run llmfit as a server for cluster schedulers:

llmfit serve --host 0.0.0.0 --port 8787

# Get top runnable models for this node
curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"

# Node hardware info
curl http://localhost:8787/api/v1/system

# Full fit list with filters
curl "http://localhost:8787/api/v1/models?min_fit=marginal&runtime=llamacpp&sort=score&limit=20"

🤖 OpenClaw Integration

llmfit ships as an OpenClaw skill that lets AI agents recommend hardware-appropriate local models and auto-configure Ollama, llama.cpp, or LM Studio providers. Install the skill and ask your OpenClaw agent things like “What model should I use for coding on my MacBook?” The agent will call llmfit recommend --json under the hood, interpret the results, and offer to configure your setup.

# Install the skill
./scripts/install-openclaw-skill.sh
cp -r skills/llmfit-advisor ~/.openclaw/skills/

Platforms

  • 🐧 Linux - Full support (NVIDIA, AMD, Intel Arc, Ascend)
  • 🍎 macOS - Full support (Apple Silicon, Intel)
  • 🪟 Windows - Full support

⚡ Speed Estimation

Token generation is memory-bandwidth-bound: each token requires reading the full model weights from VRAM. llmfit estimates throughput using:

(bandwidth_GB/s ÷ model_size_GB) × efficiency_factor (0.55)

The efficiency factor accounts for kernel overhead, KV-cache reads, and memory controller effects. llmfit includes a bandwidth lookup table covering ~80 GPUs (NVIDIA consumer + datacenter, AMD RDNA + CDNA, Apple Silicon).

For unrecognized GPUs, it uses backend-specific speed constants:

BackendSpeed Constant
CUDA220
Metal160
ROCm180
SYCL100
CPU (ARM)90
CPU (x86)70
NPU (Ascend)390

Penalties apply for CPU offload (0.5×), CPU-only (0.3×), and MoE expert switching (0.8×).

Get Started

Installation

macOS / Linux (Homebrew):

brew install llmfit

Windows:

scoop install llmfit

Quick Install:

curl -fsSL https://llmfit.axjns.dev/install.sh | sh

Build from Source:

git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --release
# Binary at target/release/llmfit

CLI Mode

Prefer classic command-line output? Use --cli:

# Table of all models ranked by fit
llmfit --cli

# Only perfectly fitting models, top 5
llmfit fit --perfect -n 5

# Show detected system specs
llmfit system

# Search by name, provider, or size
llmfit search "llama 8b"

# Detailed view of a single model
llmfit info "Mistral-7B"

# Top 5 recommendations as JSON
llmfit recommend --json --limit 5

# Plan required hardware for a specific model
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192

GPU Memory Override

If autodetection fails (VMs, broken nvidia-smi), manually specify VRAM:

llmfit --memory=32G
llmfit --memory=32000M

🔗 GitHub: github.com/AlexsJones/llmfit

Why This Tool Rocks

  • Instant Answers - No more guessing which model will run on your hardware
  • Multi-GPU Support - Detects and aggregates VRAM across multiple GPUs
  • MoE Intelligence - Correctly handles Mixture-of-Experts models like Mixtral
  • Smart Quantization - Automatically picks the best quantization that fits your memory
  • Direct Downloads - Download models directly from the TUI via Ollama or llama.cpp
  • Speed Estimation - Realistic tok/s predictions based on actual hardware bandwidth
  • REST API - Integrate into cluster schedulers and orchestration tools
  • 6 Beautiful Themes - Dracula, Solarized, Nord, Monokai, Gruvbox, and more
  • Free & Open Source - MIT licensed, no subscriptions

Crepi il lupo! 🐺