Rapid-MLX: the local AI engine 4.2x faster than Ollama on Apple Silicon
π The Mac just became the best machine for running an LLM
For years, local AI on Mac was a compromise. It worked, but it was slow. Ollama had simplified installation, llama.cpp provided the engine, but neither truly tapped into the raw power of Apple Silicon chips.
In June 2026, a GitHub repo by raullenchai changed the game. Rapid-MLX is an open-source inference engine that directly leverages Apple's MLX framework with native Metal compute kernels. The result: up to 4.2x faster than Ollama on the same machines, drop-in OpenAI API compatibility, and a single pip command installation.
The timing is perfect. Ollama 0.30 bascule sur llama.cpp : la rΓ©volution architecturale qui change le local AI shows that Ollama remains on the llama.cpp architecture, while MLX proves in benchmarks that it beats llama.cpp by up to 3x in throughput according to ModelFit. Two visions are clashing, and the numbers speak for themselves.
The essentials
- Rapid-MLX is an open-source LLM server specifically designed for Apple Silicon, using the MLX framework with native Metal kernels.
- Independent benchmarks measure between 2.6x and 4.2x speed gains compared to Ollama on real inference tasks.
- It is 100% compatible with the OpenAI API, allowing it to be instantly plugged into Cursor, Claude Code, Aider and any development tool.
- Installation comes down to a single pip command, no Docker, no heavy dependencies.
- Ollama 0.19 integrated an MLX backend in April 2026, but Rapid-MLX remains significantly faster because it has been designed natively for MLX from day one.
Recommended Tools
| Tool | Main Usage | Price (June 2026, check on site) | Ideal for |
|---|---|---|---|
| Rapid-MLX | Local LLM server on Mac | Free (open-source) | Mac developers looking for max performance |
| Ollama | Multi-platform LLM server | Free (open-source) | Ease of use, wide compatibility |
| Hostinger | Web hosting to deploy AI apps | Starting from 2.99β¬/month | Deploying interfaces around local LLMs |
Why MLX beats llama.cpp on Apple Silicon
The answer comes down to one word: integration. llama.cpp was designed for the CPU with a GPU backend added later. MLX is the opposite β Apple created this framework specifically for its chips.
The MLX architecture explained simply
MLX is a numerical computing framework developed by Apple's ML Research team. It was designed for the Unified Memory of M1/M2/M3/M4 chips. Unlike a PC where the CPU and GPU each have their own VRAM, a Mac shares all its RAM between the two.
Rapid-MLX leverages this architecture with Metal compute kernels written specifically for each inference operation. No intermediate translation, no generic abstraction layer. The computation goes directly from the model to the GPU cores via Metal.
According to the study MLX vs Ollama on Apple Silicon (2026) β Real Benchmarks published by WillItRunAI in April 2026, MLX uses about 10% less memory than Ollama for the same model and achieves 15 to 30% more throughput.
Up to 3x compared to llama.cpp
The ModelFit June 2026 comparison goes even further by showing that MLX beats llama.cpp by up to 3x on certain configurations. The difference is explained by kernel optimization: where llama.cpp uses generic CUDA operations translated to Metal, MLX has dedicated kernels for each type of transformer layer.
It's the difference between a driver who knows the route and a generic GPS. Both will take you to the same place, but the first one takes the shortcuts.
Benchmarks: 2.6x to 4.2x faster than Ollama
The figures come from independent sources, not from the project's creator. This is important for credibility.
Ship or Skip's verdict: 4.2x
Ship or Skip gave its "Ship" (recommended) verdict with a 4.2x factor measured on long generation tasks. The test focused on a real-world development scenario: code generation, file analysis, rapid iterations.
Awesome Agents confirms at 2.6x
The June 2026 Awesome Agents benchmark measures 2.6x faster than Ollama with 66 supported model aliases. The test includes time-to-first-token and sustained tokens-per-second metrics.
Andrew.ooo's review: the most comprehensive
Andrew.ooo's May 2026 review is probably the most honest analysis available. Andrew not only tests raw speeds, but also integration with Claude Code, stability over long sessions, and the project's real limitations.
His verdict: Rapid-MLX is indeed the fastest on Apple Silicon, but it still lacks maturity compared to Ollama in certain aspects like handling multiple GGUF models or the template system.
Benchmark summary table
| Source | Speed factor vs Ollama | Tested model | Machine |
|---|---|---|---|
| Ship or Skip | 4.2x | Long generation | Apple Silicon (unspecified) |
| Awesome Agents | 2.6x | 66 aliases | Apple Silicon M-series |
| WillItRunAI | 1.15-1.30x (MLX vs Ollama) | MLX backend | Apple Silicon |
| Andrew.ooo | 3-4x (generation) | Mixed | Mac M-series |
The variance between 1.3x and 4.2x is explained by the test scenarios. Raw throughput (continuous tokens/second) shows more modest differences. It is on time-to-first-token and short, frequent generations that the gap explodes β exactly the usage profile of a developer with an AI assistant.
Installation: a single pip command, that's it
This is undoubtedly the most striking point. No Docker, no binary to download, no complex installation script.
pip install rapid-mlx
The package is available on PyPI. Once installed, the server launches with a simple command and automatically exposes an OpenAI-compatible API on the local port.
Minimal configuration
After installation, you specify the model to load and the listening port. Rapid-MLX automatically downloads the weights from Hugging Face if necessary, converts them to the MLX format, and starts the server.
The default address is http://localhost:8000, with the classic endpoints: /v1/chat/completions, /v1/completions, /v1/models. Any OpenAI client can connect to it simply by changing the base URL.
Supported models
Rapid-MLX supports models in the native MLX format. According to the Awesome Agents benchmarks, 66 model aliases are recognized. For the Meilleurs Modeles Ollama you already know, most exist in an MLX version on Hugging Face.
Among the current open-source models in the reference list, the best suited for local use on Mac are compact models like Alibaba's Qwen3.6-27B (score 74) or Qwen3.5-27B (score 63), which fit comfortably into 32 GB of unified RAM. DeepSeek V4 Flash (High) with its score of 71 is also an excellent candidate for Mac, especially since antirez's ds4 engine was designed precisely to make it usable locally.
Integration with development tools
This is where OpenAI API compatibility really shines. You don't change your workflow, you just change the server URL.
Cursor, Aider, OpenCode
In Cursor, go to the settings, in the "Models" section, add a custom model with the base URL http://localhost:8000/v1. Cursor automatically detects the available models via the /v1/models endpoint.
With Aider, it's even simpler: aider --openai-api-base http://localhost:8000/v1. Aider lists the available models and you choose.
OpenCode and any tool based on the OpenAI Python or JavaScript SDK work the same way. It's plug-and-play.
Claude Code and the specific case of integration
La review d'Andrew.ooo specifically tests the integration with Claude Code. The operation is similar: Claude Code can be configured to point to a local endpoint instead of the Anthropic API.
The advantage is twofold: you keep Claude Code's interface and workflow, but you use a free local model. Network latency disappears, so do the costs, and your data never leaves the machine.
Prompt caching: the real hidden gain
Rapid-MLX implements prompt caching natively via MLX. When you send the same system context or the same files repeatedly (a classic scenario in development), the engine does not recalculate the embeddings every time.
This partly explains the 4.2x gap on the Ship or Skip benchmarks: in development, we often send the same context with different questions. Prompt caching turns these repetitive requests into massive gains.
Ollama 0.19 integrated MLX β why Rapid-MLX is even faster
This is the legitimate question everyone is asking. In April 2026, Ollama 0.19 integrated an MLX backend, doubling the speed on Apple Silicon. So why not just use Ollama with MLX?
Native architecture vs added backend
Ollama was built around llama.cpp. The integration of MLX in Ollama 0.19 is an additional backend, not a rewrite. Ollama's internal pipeline β model management, templating, request routing β remains that of llama.cpp with a translation to MLX at runtime.
Rapid-MLX was born on MLX. Every component of the pipeline is optimized for this architecture. No translation layer, no abstraction overhead.
The MLX vs Ollama comparison by WillItRunAI
WillItRunAI's April 2026 study specifically compares pure MLX (via Rapid-MLX) to Ollama 0.19 with the MLX backend. The result: pure MLX maintains a 15 to 30% advantage in throughput and 10% in memory consumption.
The gap widens on short generations and frequent calls β exactly the use case for AI-assisted development.
When to choose Ollama over Rapid-MLX
Rapid-MLX is not the right choice in every scenario. If you need Ollama's vast library of GGUF models, its ecosystem of managers (Ollama WebUI, etc.), or if you are working on Linux or Windows, Ollama remains the logical choice.
Rapid-MLX is a Mac specialist tool. It shines when you want to extract maximum performance from your Apple Silicon for daily development.
The full comparison Running LLMs Locally on macOS: The Complete 2026 Comparison from Dev.to (March 2026) clearly positions the tools: LM Studio for the graphical interface, Ollama for multi-platform simplicity, and Rapid-MLX for pure performance on Mac.
Real-world use cases on Mac
Local assisted development with Qwen3.6-27B
Alibaba's Qwen3.6-27B (score 74) is the sweet spot for a Mac with 32 GB of RAM. It offers solid code performance while leaving enough memory for your IDE, browser, and system.
With Rapid-MLX, you get near-instant responses in Cursor or Aider. The time-to-first-token drops below the 200ms mark in most cases, making the experience identical to a cloud API call.
Code analysis with DeepSeek V4 Flash
DeepSeek V4 Flash (High) with its score of 71 is built for speed. Combined with Rapid-MLX, it becomes a formidable codebase analysis tool. You can send it entire files, request code reviews, refactoring suggestions β all locally and without network latency.
Local AI agent with tool calling
Rapid-MLX supports tool calling via the OpenAI API. This means you can build agents that read your files, execute shell commands, interact with your development API β all locally.
It's the same pattern as GPT-4 or Claude-based agents, but without recurring costs and without sending your source code to a third party.
β Common mistakes
Mistake 1: Choosing a model too large for your RAM
Rapid-MLX is fast, but it can't create memory. A 70B parameter model in Q4 quantization requires about 40 GB of RAM. If your Mac has 32 GB, it won't work, even with the best engine in the world.
Solution: start with models under 30B parameters. Qwen3.6-27B and DeepSeek V4 Flash are ideal candidates for 32 GB. Check out our Meilleurs Modeles Ollama guide to refine your choice based on your config.
Mistake 2: Comparing benchmarks on different machines
A benchmark on an M1 Max 64 GB has nothing to do with an M2 Air 16 GB. Speed factors (2.6x, 4.2x) are measured on the same machine, but the absolute throughput depends entirely on your chip and your RAM.
Solution: read benchmarks to understand the order of magnitude, but test on your own machine. Installation takes 2 minutes.
Mistake 3: Ignoring prompt caching in evaluation
If you test Rapid-MLX with unique prompts every time, you won't see the maximum benefit. Prompt caching is a structural advantage that manifests in real work sessions, not in synthetic benchmarks.
Solution: test in real conditions β same system context, repeatedly sent files, chained questions. That's where the gap with Ollama explodes.
Mistake 4: Using Rapid-MLX in production without fallback
Rapid-MLX is a young project. It can crash, it has edge cases, it doesn't handle all model formats yet. Using it as the sole backend in production without a plan B is risky.
Solution: keep Ollama or a cloud API as a fallback. The OpenAI API compatibility makes the switchover trivial β you just need to change the base URL.
β Frequently Asked Questions
Does Rapid-MLX work on Intel Mac?
No. Rapid-MLX uses the Metal compute kernels from the MLX framework, which are exclusive to Apple Silicon chips (M1 and above). On Intel Macs, use Ollama with llama.cpp.
What is the advantage over LM Studio?
LM Studio offers a graphical interface and runs on llama.cpp. Rapid-MLX is a headless server optimized for MLX, significantly faster on Apple Silicon. For a detailed comparison, check out our page on Ollama vs LM Studio.
Can I use Rapid-MLX with GGUF models?
Not directly. Rapid-MLX expects models in the native MLX format. Conversion from GGUF is possible but adds an extra step. MLX models are available on Hugging Face for most popular LLMs.
Will Ollama catch up to Rapid-MLX?
Ollama 0.19 has already integrated MLX as a backend, but its internal architecture remains designed around llama.cpp. The 15-30% gap measured by WillItRunAI suggests that catching up to a native MLX engine will be difficult without a major rewrite.
How much RAM do you need to get started?
16 GB is enough for models up to 7-8B parameters in Q4. To fully leverage Rapid-MLX with models like Qwen3.6-27B or DeepSeek V4 Flash, 32 GB is the recommended minimum.
β Conclusion
Rapid-MLX doesn't replace Ollama β it surpasses it on Apple Silicon by leveraging what llama.cpp cannot: a native MLX architecture without a translation layer. With benchmarks ranging from 2.6x to 4.2x, a one-line pip installation, and OpenAI API compatibility that integrates everywhere, it has become the obvious choice for any Mac developer doing local AI on a daily basis. To set up your complete environment, follow our local LLM installation guide.